-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-121285: Remove backtracking when parsing tarfile headers #121286
Changes from 2 commits
abb1b03
647f4ef
0b5341b
9a6cc59
39a419c
4d04f3b
3477a0a
94dd198
add30cd
6ca6329
b5c4511
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1422,7 +1422,7 @@ def _proc_pax(self, tarfile): | |
POSIX.1-2008. | ||
""" | ||
# Read the header information. | ||
buf = tarfile.fileobj.read(self._block(self.size)) | ||
buf: bytes = tarfile.fileobj.read(self._block(self.size)) | ||
|
||
# A pax header stores supplemental information for either | ||
# the following file (extended) or all following files | ||
|
@@ -1437,9 +1437,15 @@ def _proc_pax(self, tarfile): | |
# these fields are UTF-8 encoded but since POSIX.1-2008 tar | ||
# implementations are allowed to store them as raw binary strings if | ||
# the translation to UTF-8 fails. | ||
match = re.search(br"\d+ hdrcharset=([^\n]+)\n", buf) | ||
if match is not None: | ||
pax_headers["hdrcharset"] = match.group(1).decode("utf-8") | ||
if ( | ||
# Statement is both a contains check (!=-1) and a bounds check (>0) | ||
(hdrcharset_offset := buf.find(b" hdrcharset=") - 1) > -1 | ||
# Check that the character before is a digit (0x30-0x39 is 0-9) | ||
and 0x30 <= buf[hdrcharset_offset] <= 0x39 | ||
): | ||
match = re.match(br"^\d{1,20} hdrcharset=([^\n]+)\n", buf[hdrcharset_offset:]) | ||
if match is not None: | ||
pax_headers["hdrcharset"] = match.group(1).decode("utf-8") | ||
|
||
# For the time being, we don't care about anything other than "BINARY". | ||
# The only other value that is currently allowed by the standard is | ||
|
@@ -1454,14 +1460,14 @@ def _proc_pax(self, tarfile): | |
# "%d %s=%s\n" % (length, keyword, value). length is the size | ||
# of the complete record including the length field itself and | ||
# the newline. keyword and value are both UTF-8 encoded strings. | ||
regex = re.compile(br"(\d+) ([^=]+)=") | ||
regex = re.compile(br"^(\d{1,20}) ([^=]+)=") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can use possessive qualifiers There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if sticking with over-reliance on regular expressions in 3.11 and later, yes, true (thanks for pointing out the modern feature!). Though I don't think we should do that. See other comments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Non-greedy qualifier has almost the same effect here -- removes backtracking. We can use new feature in new versions and older features in older versions. |
||
pos = 0 | ||
while match := regex.match(buf, pos): | ||
while match := regex.match(buf[pos:]): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this change? It makes the complexity quadratic, because at every iteration you make a copy of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this was consequence of the regardless, it does have that downside. I didn't dive into the Regardless, this isn't great. I think what I would prefer instead of a single regex to attempt to parse these headers is more to follow the intent of the format: extract leading digits from buf@pos until a non-digit is encountered. error if that is not a space, error if the length is nonsense. error if the remaining data cannot satisfy length. partition the remaining data per the length on off the top of my head in this comment, the logic would probably look something like: # at module scope:
_length_prefix_re = re.compile(rb'(\d{1,20}) ')`
# within the while loop:
if not match := _length_prefix_re.match(buf, pos):
raise InvalidHeaderError(...)
try:
length = int(match.group(1))
except ValueError:
raise InvalidHeaderError(...)
if length <= 3:
raise InvalidHeaderError(...)
kv = buf[match.end(1) + 1:match.start(1) + length - 1] # is this right?
keyword, equals, value = kv.partition(b'=')
if not keyword or equals != b'=' or ... ensure no invalid chars in keyword or value ...:
raise InvalidHeaderError(...)
... decode_pax_field stuff ... # but see other comment, delay this until after the loop
pax_headers[keyword] = value
... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's the opposite. Actually, the code may be simplified/optimized even more by using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The usage of the index as a parameter and splicing the buffer do act different:
the results from the print statements differ despite the regex having a carrot at the start. If the regex is changed to have a carrot at the start, then some tests will fail. With this buffer splicing all tests pass, but it is quite inefficient. |
||
length, keyword = match.groups() | ||
length = int(length) | ||
if length == 0: | ||
raise InvalidHeaderError("invalid header") | ||
value = buf[match.end(2) + 1:match.start(1) + length - 1] | ||
value = buf[match.end(2) + pos + 1:match.start(1) + pos + length - 1] | ||
|
||
# Normally, we could just use "utf-8" as the encoding and "strict" | ||
# as the error handler, but we better not take the risk. For | ||
|
@@ -1520,12 +1526,26 @@ def _proc_pax(self, tarfile): | |
def _proc_gnusparse_00(self, next, pax_headers, buf): | ||
"""Process a GNU tar extended sparse header, version 0.0. | ||
""" | ||
offsets = [] | ||
for match in re.finditer(br"\d+ GNU.sparse.offset=(\d+)\n", buf): | ||
offsets.append(int(match.group(1))) | ||
numbytes = [] | ||
for match in re.finditer(br"\d+ GNU.sparse.numbytes=(\d+)\n", buf): | ||
numbytes.append(int(match.group(1))) | ||
def finditer_without_backtracking(buf, needle: bytes) -> list[int]: | ||
values = [] | ||
regex = re.compile(br"^\d{1,20}%s(\d+)\n" % (needle,)) | ||
while True: | ||
if ( | ||
# Statement is both a contains check (!=-1) and a bounds check (>0) | ||
(needle_offset := buf.find(needle) - 1) > -1 | ||
# Check that the character before is a digit (0x30-0x39 is 0-9) | ||
and 0x30 <= buf[needle_offset] <= 0x39 | ||
): | ||
match = regex.match(buf[needle_offset:]) | ||
if match is not None: | ||
values.append(int(match.group(1))) | ||
buf = buf[needle_offset + len(needle):] # Skip over to the match. | ||
else: | ||
break | ||
return values | ||
|
||
offsets = finditer_without_backtracking(buf, b" GNU.sparse.offset=") | ||
numbytes = finditer_without_backtracking(buf, b" GNU.sparse.numbytes=") | ||
next.sparse = list(zip(offsets, numbytes)) | ||
|
||
def _proc_gnusparse_01(self, next, pax_headers): | ||
|
sethmlarson marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
Remove backtracking from tarfile header parsing for ``hdrcharset``, PAX, and | ||
GNU sparse headers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simply use
br"\d hdrcharset=(.++)\n"
orbr"\d hdrcharset=(.+?)\n"
. Evenbr"\d hdrcharset=(.+)\n"
has linear complexity.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The length prefix of hdrcharset is more than a single digit. I suppose you are suggesting that since it is being ignored here we don't need to match all of its digits and can just use
\d
?The non-backtracking (++ et al) feature is new in 3.11 so won't work for older security backports. Since we need to support those it is nice to keep the same code between releases. Which is one reason why this PR headed in the direction away from over-use of regular expressions for header parsing.
It is also a bug for us to not be validating the length value on a hdrcharset record... (in both the old code and this new code). That could lead to unintended parsing consequences where we interpret a file different than other implementations would.
I think it would be better to restructure this parsing a bit. Instead of pre-searching for a possible hdrcharset= field within the header before the loop below, we should defer knowledge of and use of
encoding
until after the while loop extracting keyword=value pairs below is complete. Only then check for one being hdrcharset and use that as the encoding for._decode_pax_field
to process all of the others. That'd treat this as far less of a special case and focus pax header parsing in that one single loop.The format specification appears to be: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpshead I've implemented your suggestion (and added more framing tests) here: 39a419c Please take a look and let me know what you think.