Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CVE-2023-27043] gh-102988: Reject malformed addresses in email.parseaddr() #111116

Merged
merged 14 commits into from
Dec 15, 2023
19 changes: 15 additions & 4 deletions Doc/library/email.utils.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,18 @@ of the new API.
begins with angle brackets, they are stripped off.


.. function:: parseaddr(address)
.. function:: parseaddr(address, *, strict=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't strict=True make this is a backwards-incompatible change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind to elaborate? It's a keyword-only parameter. It sounds very unlikely that someone calls this function with strict=True since the parameter doesn't exist anymore. I don't see the addition of this parameter as a backward incompatible change.

The incompatible change is obviously the fact the emails accepted previously are now rejected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incompatible change is obviously the fact the emails accepted previously are now rejected.

Yes.
If the default stayed strict=False, or to only warn by default on invalid addresses (for 2 releases), this would match the process in PEP 387. With strict=True, it seems that you'll need a SC exception for backwards incompatible behaviour -- and that can take time.

Or do you consider this a plain bug fix? In that case, why preserve the old behaviour?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change falls into the "unfortunately interesting" category of being unsure we're doing everything correctly so it provides an opt-out. If we have confidence in our change and test suite of valid and invalid email address coverage we could proceed without the behavior flag.

Do we have such confidence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 0% confience that this change make the parser fully compatible with all RFC. I have 100% confidence that this change rejects email addresses which were accepted before.

Sorry, as I wrote before, the whole parser should be rewritten from scratch to use grammar suggested by RFC, rather than the current design.

This change only make the situation "less bad".

Also it's common that when Python becomes stricter, someone comes and explain their valid use case relying on the old behaviour.


Parse address -- which should be the value of some address-containing field such
as :mailheader:`To` or :mailheader:`Cc` -- into its constituent *realname* and
*email address* parts. Returns a tuple of that information, unless the parse
fails, in which case a 2-tuple of ``('', '')`` is returned.

If *strict* is true, use a strict parser which rejects malformed inputs.
vstinner marked this conversation as resolved.
Show resolved Hide resolved

.. versionchanged:: 3.13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is backported, it should be 3.12.1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the version number should be updated in each backport. I removed the "needs backport" label to avoid automated backports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alpha versions do not exist for common user, so I would use the 3.12.x version in which the fix will occur in the main branch too. It will help users that drop support of 3.11.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the concept but that depends on the backport situation. It is fair to start with 3.13 here and if we do decide to backport it further as a security fix later on, the docs can be updated to say 3.12.x.

Add *strict* optional parameter and reject malformed inputs by default.


.. function:: formataddr(pair, charset='utf-8')

Expand All @@ -82,12 +87,15 @@ of the new API.
Added the *charset* option.


.. function:: getaddresses(fieldvalues)
.. function:: getaddresses(fieldvalues, *, strict=True)

This method returns a list of 2-tuples of the form returned by ``parseaddr()``.
*fieldvalues* is a sequence of header field values as might be returned by
:meth:`Message.get_all <email.message.Message.get_all>`. Here's a simple
example that gets all the recipients of a message::
:meth:`Message.get_all <email.message.Message.get_all>`.

If *strict* is true, use a strict parser which rejects malformed inputs.

Here's a simple example that gets all the recipients of a message::

from email.utils import getaddresses

Expand All @@ -97,6 +105,9 @@ of the new API.
resent_ccs = msg.get_all('resent-cc', [])
all_recipients = getaddresses(tos + ccs + resent_tos + resent_ccs)

.. versionchanged:: 3.13
Add *strict* optional parameter and reject malformed inputs by default.


.. function:: parsedate(date)

Expand Down
13 changes: 13 additions & 0 deletions Doc/whatsnew/3.13.rst
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,19 @@ doctest
:attr:`doctest.TestResults.skipped` attributes.
(Contributed by Victor Stinner in :gh:`108794`.)

email
-----

* :func:`email.utils.getaddresses` and :func:`email.utils.parseaddr` now return
``('', '')`` 2-tuples in more situations where invalid email addresses are
encountered instead of potentially inaccurate values. Add optional *strict*
parameter to these two functions: use ``strict=False`` to get the old
behavior, accept malformed inputs.
``getattr(email.utils, 'supports_strict_parsing', False)`` can be use to
check if the *strict* paramater is available.
(Contributed by Thomas Dwyer and Victor Stinner for :gh:`102988` to improve
the CVE-2023-27043 fix.)

glob
----

Expand Down
151 changes: 142 additions & 9 deletions Lib/email/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
specialsre = re.compile(r'[][\\()<>@,:;".]')
escapesre = re.compile(r'[\\"]')


def _has_surrogates(s):
"""Return True if s contains surrogate-escaped binary data."""
# This check is based on the fact that unless there are surrogates, utf8
Expand Down Expand Up @@ -103,12 +104,127 @@ def formataddr(pair, charset='utf-8'):
return address


def _iter_escaped_chars(addr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can use regular expressions:

for m in re.finditer(r'(?s)\\?.', addr):
    yield (m.begin(), m.group())

Perhaps functions that use _iter_escaped_chars() can also be written using efficient regular expressions to avoid handling ever char separately. For example, _strip_quoted_realnames() could use r'(?s)(?:\\.|[^"])*|(")' or something like.

But do what looks good to you, I can optimize it later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps functions that use _iter_escaped_chars() can also be written using efficient regular expressions to avoid handling ever char separately.

This PR is based on PR #108250 which uses the following code to strip quoted real names:

    # When a comma is used in the Real Name part it is not a deliminator
    # So strip those out before counting the commas
    pattern = r'"[^"]*,[^"]*"'
    ...
    for v in fieldvalues:
        v = re.sub(pattern, '', v)
        ...

But this regular expression doesn't handle escape quote character (\"). So I wrote _strip_quoted_realnames() to replace this invalid regex.

I'm not comfortable with regex to write a parser when the grammar is that complicated. For example, antislash can also be escaped: \\. I didn't check RFC carefully. I mostly tried to support escaped quote character (\").

Also, my priority here is more to get a fix as soon as possible, since the vulnerability was reported in March, and I have to fix the CVE at work as soon as possible.

pos = 0
escape = False
for pos, ch in enumerate(addr):
if escape:
yield (pos, '\\' + ch)
escape = False
elif ch == '\\':
escape = True
else:
yield (pos, ch)
if escape:
yield (pos, '\\')


def _strip_quoted_realnames(addr):
"""Strip real names between quotes."""
if '"' not in addr:
# Fast path
return addr

start = 0
open_pos = None
result = []
for pos, ch in _iter_escaped_chars(addr):
if ch == '"':
if open_pos is None:
open_pos = pos
else:
if start != open_pos:
result.append(addr[start:open_pos])
start = pos + 1
open_pos = None

if start < len(addr):
result.append(addr[start:])

return ''.join(result)
vstinner marked this conversation as resolved.
Show resolved Hide resolved

def getaddresses(fieldvalues):
"""Return a list of (REALNAME, EMAIL) for each fieldvalue."""
all = COMMASPACE.join(str(v) for v in fieldvalues)
a = _AddressList(all)
return a.addresslist

supports_strict_parsing = True

def getaddresses(fieldvalues, *, strict=True):
"""Return a list of (REALNAME, EMAIL) or ('','') for each fieldvalue.

When parsing fails for a fieldvalue, a 2-tuple of ('', '') is returned in
its place.

If strict is true, use a strict parser which rejects malformed inputs.
"""

# If strict is true, if the resulting list of parsed addresses is greater
# than the number of fieldvalues in the input list, a parsing error has
# occurred and consequently a list containing a single empty 2-tuple [('',
# '')] is returned in its place. This is done to avoid invalid output.
#
# Malformed input: getaddresses(['alice@example.com <bob@example.com>'])
# Invalid output: [('', 'alice@example.com'), ('', 'bob@example.com')]
# Safe output: [('', '')]

if not strict:
all = COMMASPACE.join(str(v) for v in fieldvalues)
a = _AddressList(all)
return a.addresslist

fieldvalues = [str(v) for v in fieldvalues]
fieldvalues = _pre_parse_validation(fieldvalues)
addr = COMMASPACE.join(fieldvalues)
a = _AddressList(addr)
result = _post_parse_validation(a.addresslist)

# Treat output as invalid if the number of addresses is not equal to the
# expected number of addresses.
n = 0
for v in fieldvalues:
# When a comma is used in the Real Name part it is not a deliminator.
# So strip those out before counting the commas.
v = _strip_quoted_realnames(v)
# Expected number of addresses: 1 + number of commas
n += 1 + v.count(',')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be escaped commas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser doesn't support escaped commas:

>>> email.utils.getaddresses([r'Real\, name <alice@example.org>'])
[('', 'Real\\'), ('name', 'alice@example.org')]

if len(result) != n:
return [('', '')]

return result


def _check_parenthesis(addr):
# Ignore parenthesis in quoted real names.
addr = _strip_quoted_realnames(addr)

opens = 0
for pos, ch in _iter_escaped_chars(addr):
if ch == '(':
opens += 1
elif ch == ')':
opens -= 1
if opens < 0:
return False
return (opens == 0)


def _pre_parse_validation(email_header_fields):
accepted_values = []
for v in email_header_fields:
if not _check_parenthesis(v):
v = "('', '')"
accepted_values.append(v)

return accepted_values


def _post_parse_validation(parsed_email_header_tuples):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_pre_parse_validation and _post_parse_validation can use list comprehensions or be turned into generator functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_post_parse_validation() output should be a list. There is no advantage to convert it to a generator.

If _pre_parse_validation() becomes a generator, parseaddr() has to create a list anyway for list[0]. getaddresses() cannot use a generator for _pre_parse_validation() result neither, since it consumes fieldvalues twice.

accepted_values = []
# The parser would have parsed a correctly formatted domain-literal
# The existence of an [ after parsing indicates a parsing failure
for v in parsed_email_header_tuples:
if '[' in v[1]:
v = ('', '')
accepted_values.append(v)

return accepted_values


def _format_timetuple_and_zone(timetuple, zone):
Expand Down Expand Up @@ -207,16 +323,33 @@ def parsedate_to_datetime(data):
tzinfo=datetime.timezone(datetime.timedelta(seconds=tz)))


def parseaddr(addr):
def parseaddr(addr, *, strict=True):
"""
Parse addr into its constituent realname and email address parts.

Return a tuple of realname and email address, unless the parse fails, in
which case return a 2-tuple of ('', '').

If strict is True, use a strict parser which rejects malformed inputs.
"""
addrs = _AddressList(addr).addresslist
if not addrs:
return '', ''
if not strict:
addrs = _AddressList(addr).addresslist
if not addrs:
return ('', '')
return addrs[0]

if isinstance(addr, list):
addr = addr[0]

if not isinstance(addr, str):
return ('', '')

addr = _pre_parse_validation([addr])[0]
addrs = _post_parse_validation(_AddressList(addr).addresslist)

if not addrs or len(addrs) > 1:
return ('', '')

return addrs[0]


Expand Down
Loading
Loading