Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-121285: Remove backtracking when parsing tarfile headers #121286

Merged
merged 11 commits into from
Aug 31, 2024
46 changes: 33 additions & 13 deletions Lib/tarfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -1422,7 +1422,7 @@ def _proc_pax(self, tarfile):
POSIX.1-2008.
"""
# Read the header information.
buf = tarfile.fileobj.read(self._block(self.size))
buf: bytes = tarfile.fileobj.read(self._block(self.size))

# A pax header stores supplemental information for either
# the following file (extended) or all following files
Expand All @@ -1437,9 +1437,15 @@ def _proc_pax(self, tarfile):
# these fields are UTF-8 encoded but since POSIX.1-2008 tar
# implementations are allowed to store them as raw binary strings if
# the translation to UTF-8 fails.
match = re.search(br"\d+ hdrcharset=([^\n]+)\n", buf)
if match is not None:
pax_headers["hdrcharset"] = match.group(1).decode("utf-8")
if (
# Statement is both a contains check (!=-1) and a bounds check (>0)
(hdrcharset_offset := buf.find(b" hdrcharset=") - 1) > -1
# Check that the character before is a digit (0x30-0x39 is 0-9)
and 0x30 <= buf[hdrcharset_offset] <= 0x39
):
match = re.match(br"^\d{1,20} hdrcharset=([^\n]+)\n", buf[hdrcharset_offset:])
if match is not None:
pax_headers["hdrcharset"] = match.group(1).decode("utf-8")
Copy link
Member

@serhiy-storchaka serhiy-storchaka Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply use br"\d hdrcharset=(.++)\n" or br"\d hdrcharset=(.+?)\n". Even br"\d hdrcharset=(.+)\n" has linear complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length prefix of hdrcharset is more than a single digit. I suppose you are suggesting that since it is being ignored here we don't need to match all of its digits and can just use \d?

The non-backtracking (++ et al) feature is new in 3.11 so won't work for older security backports. Since we need to support those it is nice to keep the same code between releases. Which is one reason why this PR headed in the direction away from over-use of regular expressions for header parsing.

It is also a bug for us to not be validating the length value on a hdrcharset record... (in both the old code and this new code). That could lead to unintended parsing consequences where we interpret a file different than other implementations would.

I think it would be better to restructure this parsing a bit. Instead of pre-searching for a possible hdrcharset= field within the header before the loop below, we should defer knowledge of and use of encoding until after the while loop extracting keyword=value pairs below is complete. Only then check for one being hdrcharset and use that as the encoding for ._decode_pax_field to process all of the others. That'd treat this as far less of a special case and focus pax header parsing in that one single loop.

The format specification appears to be: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpshead I've implemented your suggestion (and added more framing tests) here: 39a419c Please take a look and let me know what you think.


# For the time being, we don't care about anything other than "BINARY".
# The only other value that is currently allowed by the standard is
Expand All @@ -1454,14 +1460,14 @@ def _proc_pax(self, tarfile):
# "%d %s=%s\n" % (length, keyword, value). length is the size
# of the complete record including the length field itself and
# the newline. keyword and value are both UTF-8 encoded strings.
regex = re.compile(br"(\d+) ([^=]+)=")
regex = re.compile(br"^(\d{1,20}) ([^=]+)=")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use possessive qualifiers br"(\d++) ([^=]++)=".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sticking with over-reliance on regular expressions in 3.11 and later, yes, true (thanks for pointing out the modern feature!). Though I don't think we should do that. See other comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-greedy qualifier has almost the same effect here -- removes backtracking.

We can use new feature in new versions and older features in older versions.

pos = 0
while match := regex.match(buf, pos):
while match := regex.match(buf[pos:]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change? It makes the complexity quadratic, because at every iteration you make a copy of buf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was consequence of the ^ anchored regex as..match(buf, pos) does not anchor the ^ to pos, but to the absolute start of buf.

regardless, it does have that downside. I didn't dive into the buf = tarfile.fileobj.read(self._block(self.size)) above, but expect that means the total amount of data is limited so the consequences of this copying would at least be constrained.

Regardless, this isn't great. I think what I would prefer instead of a single regex to attempt to parse these headers is more to follow the intent of the format:

extract leading digits from buf@pos until a non-digit is encountered. error if that is not a space, error if the length is nonsense. error if the remaining data cannot satisfy length. partition the remaining data per the length on = - error if the remainder started or ended with the = or no = was found.

off the top of my head in this comment, the logic would probably look something like:

# at module scope:
_length_prefix_re = re.compile(rb'(\d{1,20}) ')`

# within the while loop:
if not match := _length_prefix_re.match(buf, pos):
    raise InvalidHeaderError(...)
try:
    length = int(match.group(1))
except ValueError:
    raise InvalidHeaderError(...)
if length <= 3:
    raise InvalidHeaderError(...)
kv = buf[match.end(1) + 1:match.start(1) + length - 1]  # is this right?
keyword, equals, value = kv.partition(b'=')
if not keyword or equals != b'=' or ... ensure no invalid chars in keyword or value ...:
    raise InvalidHeaderError(...)
... decode_pax_field stuff ...  # but see other comment, delay this until after the loop
pax_headers[keyword] = value
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the opposite. ^ matches at the beginning of the string or after \n (if multiline flag is set). If you use match(), ^ has no effect if used in match() without starting index. So this change does not make sense.

Actually, the code may be simplified/optimized even more by using finditer().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of the index as a parameter and splicing the buffer do act different:

import re
example_string = "\nabcdefg"
regex = re.compile("^abc")
ind = 1 # Some search index
# See different result:
print(regex.match(example_string, ind))
print(regex.match(example_string[ind:]))

the results from the print statements differ despite the regex having a carrot at the start. If the regex is changed to have a carrot at the start, then some tests will fail. With this buffer splicing all tests pass, but it is quite inefficient.

length, keyword = match.groups()
length = int(length)
if length == 0:
raise InvalidHeaderError("invalid header")
value = buf[match.end(2) + 1:match.start(1) + length - 1]
value = buf[match.end(2) + pos + 1:match.start(1) + pos + length - 1]

# Normally, we could just use "utf-8" as the encoding and "strict"
# as the error handler, but we better not take the risk. For
Expand Down Expand Up @@ -1520,12 +1526,26 @@ def _proc_pax(self, tarfile):
def _proc_gnusparse_00(self, next, pax_headers, buf):
"""Process a GNU tar extended sparse header, version 0.0.
"""
offsets = []
for match in re.finditer(br"\d+ GNU.sparse.offset=(\d+)\n", buf):
offsets.append(int(match.group(1)))
numbytes = []
for match in re.finditer(br"\d+ GNU.sparse.numbytes=(\d+)\n", buf):
numbytes.append(int(match.group(1)))
def finditer_without_backtracking(buf, needle: bytes) -> list[int]:
values = []
regex = re.compile(br"^\d{1,20}%s(\d+)\n" % (needle,))
while True:
if (
# Statement is both a contains check (!=-1) and a bounds check (>0)
(needle_offset := buf.find(needle) - 1) > -1
# Check that the character before is a digit (0x30-0x39 is 0-9)
and 0x30 <= buf[needle_offset] <= 0x39
):
match = regex.match(buf[needle_offset:])
if match is not None:
values.append(int(match.group(1)))
buf = buf[needle_offset + len(needle):] # Skip over to the match.
else:
break
return values

offsets = finditer_without_backtracking(buf, b" GNU.sparse.offset=")
numbytes = finditer_without_backtracking(buf, b" GNU.sparse.numbytes=")
next.sparse = list(zip(offsets, numbytes))

def _proc_gnusparse_01(self, next, pax_headers):
Expand Down
sethmlarson marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Remove backtracking from tarfile header parsing for ``hdrcharset``, PAX, and
GNU sparse headers.
Loading