Skip to content

gh-121999: Change default tarfile filter to 'data' #122002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 26, 2024
12 changes: 8 additions & 4 deletions Doc/library/shutil.rst
Original file line number Diff line number Diff line change
Expand Up @@ -706,11 +706,9 @@ provided. They rely on the :mod:`zipfile` and :mod:`tarfile` modules.

The keyword-only *filter* argument is passed to the underlying unpacking
function. For zip files, *filter* is not accepted.
For tar files, it is recommended to set it to ``'data'``,
unless using features specific to tar and UNIX-like filesystems.
For tar files, it is recommended to use ``'data'`` (default since Python
3.14), unless using features specific to tar and UNIX-like filesystems.
(See :ref:`tarfile-extraction-filter` for details.)
The ``'data'`` filter will become the default for tar files
in Python 3.14.

.. audit-event:: shutil.unpack_archive filename,extract_dir,format shutil.unpack_archive

Expand All @@ -721,6 +719,12 @@ provided. They rely on the :mod:`zipfile` and :mod:`tarfile` modules.
the *extract_dir* argument, e.g. members that have absolute filenames
starting with "/" or filenames with two dots "..".

Since Python 3.14, the defaults for both built-in formats (zip and tar
files) will prevent the most dangerous of such security issues,
but will not prevent *all* unintended behavior.
Read the :ref:`tarfile-further-verification`
section for tar-specific details.

.. versionchanged:: 3.7
Accepts a :term:`path-like object` for *filename* and *extract_dir*.

Expand Down
75 changes: 47 additions & 28 deletions Doc/library/tarfile.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,12 @@ Some facts and figures:
Archives are extracted using a :ref:`filter <tarfile-extraction-filter>`,
which makes it possible to either limit surprising/dangerous features,
or to acknowledge that they are expected and the archive is fully trusted.
By default, archives are fully trusted, but this default is deprecated
and slated to change in Python 3.14.

.. versionchanged:: 3.14
Set the default extraction filter to :func:`data <data_filter>`,
which disallows some dangerous features such as links to absolute paths
or paths outside of the destination. Previously, the filter strategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this one but the outside of now feels weird to me :')

@AA-Turner As a native speaker (you're the only one I know...), should it be "outside the destination", or "outside of"? (or something else entirely?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "outside of" is conventional in North America, less conventional in the UK.
https://learningenglish.voanews.com/a/should-we-think-outside-or-outside-of-the-box-/6434530.html

I think I was reading from PEP 706 when I wrote that update.

Refuse to extract links (hard or soft) which end up linking to a path outside of the destination. (On systems that don’t support links, tarfile will, in most cases, fall back to creating regular files. This proposal doesn’t change that behaviour.)

was equivalent to :func:`fully_trusted <fully_trusted_filter>`.

.. function:: open(name=None, mode='r', fileobj=None, bufsize=10240, **kwargs)

Expand Down Expand Up @@ -495,18 +498,18 @@ be finalized; only the internally used file object will be closed. See the
The *filter* argument specifies how ``members`` are modified or rejected
before extraction.
See :ref:`tarfile-extraction-filter` for details.
It is recommended to set this explicitly depending on which *tar* features
you need to support.
It is recommended to set this explicitly only if specific *tar* features
are required, or as ``filter='data'`` to support Python versions with a less
secure default (3.13 and lower).

.. warning::

Never extract archives from untrusted sources without prior inspection.
It is possible that files are created outside of *path*, e.g. members
that have absolute filenames starting with ``"/"`` or filenames with two
dots ``".."``.

Set ``filter='data'`` to prevent the most dangerous security issues,
and read the :ref:`tarfile-extraction-filter` section for details.
Since Python 3.14, the default (:func:`data <data_filter>`) will prevent
the most dangerous security issues.
However, it will not prevent *all* unintended or insecure behavior.
Read the :ref:`tarfile-extraction-filter` section for details.

.. versionchanged:: 3.5
Added the *numeric_owner* parameter.
Expand All @@ -517,6 +520,9 @@ be finalized; only the internally used file object will be closed. See the
.. versionchanged:: 3.12
Added the *filter* parameter.

.. versionchanged:: 3.14
The *filter* parameter now defaults to ``'data'``.


.. method:: TarFile.extract(member, path="", set_attrs=True, *, numeric_owner=False, filter=None)

Expand All @@ -536,10 +542,8 @@ be finalized; only the internally used file object will be closed. See the

.. warning::

See the warning for :meth:`extractall`.

Set ``filter='data'`` to prevent the most dangerous security issues,
and read the :ref:`tarfile-extraction-filter` section for details.
Never extract archives from untrusted sources without prior inspection.
See the warning for :meth:`extractall` for details.

.. versionchanged:: 3.2
Added the *set_attrs* parameter.
Expand Down Expand Up @@ -602,14 +606,8 @@ be finalized; only the internally used file object will be closed. See the
String names are not allowed for this attribute, unlike the *filter*
argument to :meth:`~TarFile.extract`.

If ``extraction_filter`` is ``None`` (the default),
calling an extraction method without a *filter* argument will raise a
``DeprecationWarning``,
and fall back to the :func:`fully_trusted <fully_trusted_filter>` filter,
whose dangerous behavior matches previous versions of Python.

In Python 3.14+, leaving ``extraction_filter=None`` will cause
extraction methods to use the :func:`data <data_filter>` filter by default.
If ``extraction_filter`` is ``None`` (the default), extraction methods
will use the :func:`data <data_filter>` filter by default.

The attribute may be set on instances or overridden in subclasses.
It also is possible to set it on the ``TarFile`` class itself to set a
Expand All @@ -619,6 +617,14 @@ be finalized; only the internally used file object will be closed. See the
To set a global default this way, a filter function needs to be wrapped in
:func:`staticmethod()` to prevent injection of a ``self`` argument.

.. versionchanged:: 3.14

The default filter is set to :func:`data <data_filter>`,
which disallows some dangerous features such as links to absolute paths
or paths outside of the destination.
Previously, the default was equivalent to
:func:`fully_trusted <fully_trusted_filter>`.

.. method:: TarFile.add(name, arcname=None, recursive=True, *, filter=None)

Add the file *name* to the archive. *name* may be any type of file
Expand Down Expand Up @@ -969,6 +975,12 @@ In most cases, the full functionality is not needed.
Therefore, *tarfile* supports extraction filters: a mechanism to limit
functionality, and thus mitigate some of the security issues.

.. warning::

None of the available filters blocks *all* dangerous archive features.
Never extract archives from untrusted sources without prior inspection.
See also :ref:`tarfile-further-verification`.

.. seealso::

:pep:`706`
Expand All @@ -992,12 +1004,13 @@ can be:

* ``None`` (default): Use :attr:`TarFile.extraction_filter`.

If that is also ``None`` (the default), raise a ``DeprecationWarning``,
and fall back to the ``'fully_trusted'`` filter, whose dangerous behavior
matches previous versions of Python.
If that is also ``None`` (the default), the ``'data'`` filter will be used.

.. versionchanged:: 3.14

In Python 3.14, the ``'data'`` filter will become the default instead.
It's possible to switch earlier; see :attr:`TarFile.extraction_filter`.
The default filter is set to :func:`data <data_filter>`.
Previously, the default was equivalent to
:func:`fully_trusted <fully_trusted_filter>`.

* A callable which will be called for each extracted member with a
:ref:`TarInfo <tarinfo-objects>` describing the member and the destination
Expand Down Expand Up @@ -1080,6 +1093,9 @@ reused in custom filters:

Return the modified ``TarInfo`` member.

Note that this filter does not block *all* dangerous archive features.
See :ref:`tarfile-further-verification` for details.


.. _tarfile-extraction-refuse:

Expand All @@ -1093,6 +1109,8 @@ With ``errorlevel=0`` the error will be logged and the member will be skipped,
but extraction will continue.


.. _tarfile-further-verification:

Hints for further verification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -1110,9 +1128,10 @@ Here is an incomplete list of things to consider:
disk, memory and CPU usage.
* Check filenames against an allow-list of characters
(to filter out control characters, confusables, foreign path separators,
etc.).
and so on).
* Check that filenames have expected extensions (discouraging files that
execute when you “click on them”, or extension-less files like Windows special device names).
execute when you “click on them”, or extension-less files like Windows
special device names).
* Limit the number of extracted files, total size of extracted data,
filename length (including symlink length), and size of individual files.
* Check for files that would be shadowed on case-insensitive filesystems.
Expand Down
8 changes: 1 addition & 7 deletions Lib/tarfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -2248,13 +2248,7 @@ def _get_filter_function(self, filter):
if filter is None:
filter = self.extraction_filter
if filter is None:
import warnings
warnings.warn(
'Python 3.14 will, by default, filter extracted tar '
+ 'archives and reject files or modify their metadata. '
+ 'Use the filter argument to control this behavior.',
DeprecationWarning, stacklevel=3)
return fully_trusted_filter
return data_filter
if isinstance(filter, str):
raise TypeError(
'String names are not supported for '
Expand Down
3 changes: 0 additions & 3 deletions Lib/test/test_shutil.py
Original file line number Diff line number Diff line change
Expand Up @@ -2145,9 +2145,6 @@ def check_unpack_archive_with_converter(self, format, converter, **kwargs):
def check_unpack_tarball(self, format):
self.check_unpack_archive(format, filter='fully_trusted')
self.check_unpack_archive(format, filter='data')
with warnings_helper.check_warnings(
('Python 3.14', DeprecationWarning)):
self.check_unpack_archive(format)

def test_unpack_archive_tar(self):
self.check_unpack_tarball('tar')
Expand Down
52 changes: 18 additions & 34 deletions Lib/test/test_tarfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -722,6 +722,24 @@ def format_mtime(mtime):
tar.close()
os_helper.rmtree(DIR)

@staticmethod
def test_extractall_default_filter():
# Test that the default filter is now "data", and the other filter types are not used.
DIR = pathlib.Path(TEMPDIR) / "extractall_default_filter"
with (
os_helper.temp_dir(DIR),
tarfile.open(tarname, encoding="iso8859-1") as tar,
unittest.mock.patch("tarfile.data_filter", wraps=tarfile.data_filter) as mock_data_filter,
unittest.mock.patch("tarfile.tar_filter", wraps=tarfile.tar_filter) as mock_tar_filter,
unittest.mock.patch("tarfile.fully_trusted_filter", wraps=tarfile.fully_trusted_filter) as mock_ft_filter
):
directories = [t for t in tar if t.isdir()]
tar.extractall(DIR, directories)

mock_data_filter.assert_called()
mock_ft_filter.assert_not_called()
mock_tar_filter.assert_not_called()

@os_helper.skip_unless_working_chmod
def test_extract_directory(self):
dirtype = "ustar/dirtype"
Expand All @@ -738,31 +756,6 @@ def test_extract_directory(self):
finally:
os_helper.rmtree(DIR)

def test_deprecation_if_no_filter_passed_to_extractall(self):
DIR = pathlib.Path(TEMPDIR) / "extractall"
with (
os_helper.temp_dir(DIR),
tarfile.open(tarname, encoding="iso8859-1") as tar
):
directories = [t for t in tar if t.isdir()]
with self.assertWarnsRegex(DeprecationWarning, "Use the filter argument") as cm:
tar.extractall(DIR, directories)
# check that the stacklevel of the deprecation warning is correct:
self.assertEqual(cm.filename, __file__)

def test_deprecation_if_no_filter_passed_to_extract(self):
dirtype = "ustar/dirtype"
DIR = pathlib.Path(TEMPDIR) / "extractall"
with (
os_helper.temp_dir(DIR),
tarfile.open(tarname, encoding="iso8859-1") as tar
):
tarinfo = tar.getmember(dirtype)
with self.assertWarnsRegex(DeprecationWarning, "Use the filter argument") as cm:
tar.extract(tarinfo, path=DIR)
# check that the stacklevel of the deprecation warning is correct:
self.assertEqual(cm.filename, __file__)

def test_extractall_pathlike_dir(self):
DIR = os.path.join(TEMPDIR, "extractall")
with os_helper.temp_dir(DIR), \
Expand Down Expand Up @@ -4011,15 +4004,6 @@ def test_data_filter(self):
self.assertIs(filtered.name, tarinfo.name)
self.assertIs(filtered.type, tarinfo.type)

def test_default_filter_warns(self):
"""Ensure the default filter warns"""
with ArchiveMaker() as arc:
arc.add('foo')
with warnings_helper.check_warnings(
('Python 3.14', DeprecationWarning)):
with self.check_context(arc.open(), None):
self.expect_file('foo')

def test_change_default_filter_on_instance(self):
tar = tarfile.TarFile(tarname, 'r')
def strict_filter(tarinfo, path):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
The default extraction filter for the :mod:`tarfile` module is now
set to :func:`'data' <tarfile.data_filter>`.
Loading