Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: using OpenSUSE and Fedora packages which change the Bleach code, parse_shim tests fail with Python 3.10.12 #707

Closed
MeggyCal opened this issue Jun 29, 2023 · 6 comments
Labels
untriaged Bug reports that haven't been triaged

Comments

@MeggyCal
Copy link

From Python 310 WhatsNew:

gh-102153: urllib.parse.urlsplit() now strips leading C0 control and space characters following the specification for URLs defined by WHATWG in response to CVE-2023-24329. Patch by Illia Volochii.

therefore some tests fail:

[   15s] =================================== FAILURES ===================================
[   15s] _________________ test_urlparse[\t   :foo.com   \n-expected8] __________________
[   15s] 
[   15s] uri = '\t   :foo.com   \n'
[   15s] expected = ParseResult(scheme='', netloc='', path='   :foo.com   ', params='', query='', fragment='')
[   15s] 
[   15s]     @pytest.mark.parametrize(
[   15s]         "uri, expected",
[   15s]         [
[   15s]             ("", ParseResult()),
[   15s]             ("http://example\t.\norg", ParseResult(scheme="http", netloc="example.org")),
[   15s]             (
[   15s]                 "http://user:pass@foo:21/bar;par?b#c",
[   15s]                 ParseResult(
[   15s]                     scheme="http",
[   15s]                     netloc="user:pass@foo:21",
[   15s]                     path="/bar",
[   15s]                     params="par",
[   15s]                     query="b",
[   15s]                     fragment="c",
[   15s]                 ),
[   15s]             ),
[   15s]             ("https://test:@test", ParseResult(scheme="https", netloc="test:@test")),
[   15s]             ("https://:@test", ParseResult(scheme="https", netloc=":@test")),
[   15s]             (
[   15s]                 "non-special://test:@test/x",
[   15s]                 ParseResult(scheme="non-special", netloc="test:@test", path="/x"),
[   15s]             ),
[   15s]             (
[   15s]                 "non-special://:@test/x",
[   15s]                 ParseResult(scheme="non-special", netloc=":@test", path="/x"),
[   15s]             ),
[   15s]             ("http:foo.com", ParseResult(scheme="http", path="foo.com")),
[   15s]             # NOTE(willkg): The wpt tests set the scheme to http becaue that's what
[   15s]             # the base url is. Since our parser is not using a baseurl, it sets the
[   15s]             # scheme to "". Further, our parser includes spaces at the beginning,
[   15s]             # but I don't see that as being problematic.
[   15s]             ("\t   :foo.com   \n", ParseResult(path="   :foo.com   ")),
[   15s]             # NOTE(willkg): The wpt tests set the path to "/foo/foo.com" because
[   15s]             # the base url is at "/foo"
[   15s]             (" foo.com  ", ParseResult(path=" foo.com  ")),
[   15s]             ("a:\t foo.com", ParseResult(scheme="a", path=" foo.com")),
[   15s]             (
[   15s]                 "http://f:21/ b ? d # e ",
[   15s]                 ParseResult(
[   15s]                     scheme="http", netloc="f:21", path="/ b ", query=" d ", fragment=" e "
[   15s]                 ),
[   15s]             ),
[   15s]             (
[   15s]                 "lolscheme:x x#x x",
[   15s]                 ParseResult(scheme="lolscheme", path="x x", fragment="x x"),
[   15s]             ),
[   15s]             ("http://f:/c", ParseResult(scheme="http", netloc="f:", path="/c")),
[   15s]             ("http://f:0/c", ParseResult(scheme="http", netloc="f:0", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests normalize the 0000000000000 to 0 so the
[   15s]             # netloc should be "f:0".
[   15s]             (
[   15s]                 "http://f:00000000000000/c",
[   15s]                 ParseResult(scheme="http", netloc="f:00000000000000", path="/c"),
[   15s]             ),
[   15s]             # NOTE(willkg): The wpt tests drop the 0000000000000000000 altogether
[   15s]             # so the netloc should be "f".
[   15s]             (
[   15s]                 "http://f:00000000000000000000080/c",
[   15s]                 ParseResult(scheme="http", netloc="f:00000000000000000000080", path="/c"),
[   15s]             ),
[   15s]             # This is an invalid ipv6 url
[   15s]             ("http://2001::1]", ValueError),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f:b/c", ParseResult(scheme="http", netloc="f:b", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f: /c", ParseResult(scheme="http", netloc="f: ", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f:999999/c", ParseResult(scheme="http", netloc="f:999999", path="/c")),
[   15s]         ],
[   15s]     )
[   15s]     def test_urlparse(uri, expected):
[   15s]     
[   15s]         if inspect.isclass(expected) and issubclass(expected, BaseException):
[   15s]             with pytest.raises(expected):
[   15s]                 urlparse(uri)
[   15s]     
[   15s]         else:
[   15s]             parsed = urlparse(uri)
[   15s]             print(parsed)
[   15s]             assert parsed.scheme == expected.scheme
[   15s]             assert parsed.netloc == expected.netloc
[   15s] >           assert parsed.path == expected.path
[   15s] E           AssertionError: assert ':foo.com   ' == '   :foo.com   '
[   15s] E             -    :foo.com   
[   15s] E             ? ---
[   15s] E             + :foo.com
[   15s] 
[   15s] tests/test_parse_shim.py:108: AssertionError
[   15s] ----------------------------- Captured stdout call -----------------------------
[   15s] ParseResult(scheme='', netloc='', path=':foo.com   ', params='', query='', fragment='')
[   15s] _____________________ test_urlparse[ foo.com  -expected9] ______________________
[   15s] 
[   15s] uri = ' foo.com  '
[   15s] expected = ParseResult(scheme='', netloc='', path=' foo.com  ', params='', query='', fragment='')
[   15s] 
[   15s]     @pytest.mark.parametrize(
[   15s]         "uri, expected",
[   15s]         [
[   15s]             ("", ParseResult()),
[   15s]             ("http://example\t.\norg", ParseResult(scheme="http", netloc="example.org")),
[   15s]             (
[   15s]                 "http://user:pass@foo:21/bar;par?b#c",
[   15s]                 ParseResult(
[   15s]                     scheme="http",
[   15s]                     netloc="user:pass@foo:21",
[   15s]                     path="/bar",
[   15s]                     params="par",
[   15s]                     query="b",
[   15s]                     fragment="c",
[   15s]                 ),
[   15s]             ),
[   15s]             ("https://test:@test", ParseResult(scheme="https", netloc="test:@test")),
[   15s]             ("https://:@test", ParseResult(scheme="https", netloc=":@test")),
[   15s]             (
[   15s]                 "non-special://test:@test/x",
[   15s]                 ParseResult(scheme="non-special", netloc="test:@test", path="/x"),
[   15s]             ),
[   15s]             (
[   15s]                 "non-special://:@test/x",
[   15s]                 ParseResult(scheme="non-special", netloc=":@test", path="/x"),
[   15s]             ),
[   15s]             ("http:foo.com", ParseResult(scheme="http", path="foo.com")),
[   15s]             # NOTE(willkg): The wpt tests set the scheme to http becaue that's what
[   15s]             # the base url is. Since our parser is not using a baseurl, it sets the
[   15s]             # scheme to "". Further, our parser includes spaces at the beginning,
[   15s]             # but I don't see that as being problematic.
[   15s]             ("\t   :foo.com   \n", ParseResult(path="   :foo.com   ")),
[   15s]             # NOTE(willkg): The wpt tests set the path to "/foo/foo.com" because
[   15s]             # the base url is at "/foo"
[   15s]             (" foo.com  ", ParseResult(path=" foo.com  ")),
[   15s]             ("a:\t foo.com", ParseResult(scheme="a", path=" foo.com")),
[   15s]             (
[   15s]                 "http://f:21/ b ? d # e ",
[   15s]                 ParseResult(
[   15s]                     scheme="http", netloc="f:21", path="/ b ", query=" d ", fragment=" e "
[   15s]                 ),
[   15s]             ),
[   15s]             (
[   15s]                 "lolscheme:x x#x x",
[   15s]                 ParseResult(scheme="lolscheme", path="x x", fragment="x x"),
[   15s]             ),
[   15s]             ("http://f:/c", ParseResult(scheme="http", netloc="f:", path="/c")),
[   15s]             ("http://f:0/c", ParseResult(scheme="http", netloc="f:0", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests normalize the 0000000000000 to 0 so the
[   15s]             # netloc should be "f:0".
[   15s]             (
[   15s]                 "http://f:00000000000000/c",
[   15s]                 ParseResult(scheme="http", netloc="f:00000000000000", path="/c"),
[   15s]             ),
[   15s]             # NOTE(willkg): The wpt tests drop the 0000000000000000000 altogether
[   15s]             # so the netloc should be "f".
[   15s]             (
[   15s]                 "http://f:00000000000000000000080/c",
[   15s]                 ParseResult(scheme="http", netloc="f:00000000000000000000080", path="/c"),
[   15s]             ),
[   15s]             # This is an invalid ipv6 url
[   15s]             ("http://2001::1]", ValueError),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f:b/c", ParseResult(scheme="http", netloc="f:b", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f: /c", ParseResult(scheme="http", netloc="f: ", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f:999999/c", ParseResult(scheme="http", netloc="f:999999", path="/c")),
[   15s]         ],
[   15s]     )
[   15s]     def test_urlparse(uri, expected):
[   15s]     
[   15s]         if inspect.isclass(expected) and issubclass(expected, BaseException):
[   15s]             with pytest.raises(expected):
[   15s]                 urlparse(uri)
[   15s]     
[   15s]         else:
[   15s]             parsed = urlparse(uri)
[   15s]             print(parsed)
[   15s]             assert parsed.scheme == expected.scheme
[   15s]             assert parsed.netloc == expected.netloc
[   15s] >           assert parsed.path == expected.path
[   15s] E           AssertionError: assert 'foo.com  ' == ' foo.com  '
[   15s] E             -  foo.com  
[   15s] E             ? -
[   15s] E             + foo.com
[   15s] 
[   15s] tests/test_parse_shim.py:108: AssertionError
[   15s] ----------------------------- Captured stdout call -----------------------------
[   15s] ParseResult(scheme='', netloc='', path='foo.com  ', params='', query='', fragment='')

I already made a PR: #706, but copying it here so it could be discussed.

@MeggyCal MeggyCal added the untriaged Bug reports that haven't been triaged label Jun 29, 2023
@befeleme
Copy link

befeleme commented Jul 3, 2023

This issue is not specific to Python 3.10, the mentioned CVE fix was included in all Pythons since Python 3.7.
bleach vendors the Python parse.py from unsupported Python 3.6.14. Does it still need to?

@willkg
Copy link
Member

willkg commented Jul 3, 2023

First off, if you think there's a security vulnerability in Bleach, please use our security vulnerability disclosure process.

This issue is not specific to Python 3.10, the mentioned CVE fix was included in all Pythons since Python 3.7. bleach vendors the Python parse.py from unsupported Python 3.6.14. Does it still need to?

I believe CVE-2023-24329 doesn't apply to Bleach because Bleach already drops space and control characters before running the url through urlparse:

bleach/bleach/sanitizer.py

Lines 475 to 507 in 13d6c0c

def sanitize_uri_value(self, value, allowed_protocols):
"""Checks a uri value to see if it's allowed
:arg value: the uri value to sanitize
:arg allowed_protocols: list of allowed protocols
:returns: allowed value or None
"""
# NOTE(willkg): This transforms the value into a normalized one that's
# easier to match and verify, but shouldn't get returned since it's
# vastly different than the original value.
# Convert all character entities in the value
normalized_uri = html5lib_shim.convert_entities(value)
# Nix backtick, space characters, and control characters
normalized_uri = re.sub(r"[`\000-\040\177-\240\s]+", "", normalized_uri)
# Remove REPLACEMENT characters
normalized_uri = normalized_uri.replace("\ufffd", "")
# Lowercase it--this breaks the value, but makes it easier to match
# against
normalized_uri = normalized_uri.lower()
try:
# Drop attributes with uri values that have protocols that aren't
# allowed
parsed = parse_shim.urlparse(normalized_uri)
except ValueError:
# URI is impossible to parse, therefore it's not allowed
return None

Bleach vendors parse.py from Python 3.6.14 to solve the problem where urlparse changed behavior in Python 3.9. There's some discussion in #565 .

It's possible we could update the parse.py code to a more recent version that meets needs and doesn't introduce new issues. It's also possible we could re-evaluate the original problem and find a new solution that vendors something different or takes another approach.

@willkg
Copy link
Member

willkg commented Jul 3, 2023

I installed Python 3.10.12 and I can't reproduce this issue--the tests all pass for me. Also, I'm a little surprised this is a problem since Bleach vendors it's parse implementation and the tests that are failing are testing the vendored parse implementation which didn't change.

@MeggyCal Can you give me steps to reproduce the test failures you're seeing?

@befeleme
Copy link

befeleme commented Jul 3, 2023

I can't speak for @MeggyCal, but in Fedora, we use the Python's standard library urllib.parse instead of the vendored one. This is generally considered a desired approach which reduces the size of the resulting packages and lowers the maintainers burden (eg. in case of need to deliver a patch to all the versions of a library in the distribution).

Since test_parse_shim.py only tests the urlparse behavior and not bleach' sanitizer, the expected outputs contain the leading whitespace characters. This behavior no longer appears in urlparse in any Python version newer than 3.6, hence the suggestion that the vendored version could be updated.

@MeggyCal
Copy link
Author

MeggyCal commented Jul 4, 2023

We take similar approach in (open)SUSE like the one described in #707 (comment). Thanks, @befeleme !

@willkg willkg changed the title bug: test failure with Python 3.10.12 bug: using OpenSUSE and Fedora packages which change the Bleach code, parse_shim tests fail with Python 3.10.12 Jul 5, 2023
@willkg
Copy link
Member

willkg commented Jul 5, 2023

Given that, I'm inferring the steps to reproduce are something like this:

  1. Python 3.10.12, Bleach v6.0.0 as packaged in OpenSUSE or Fedora
  2. run the tests somehow
  3. get test failures in test_parse_shim which tests that the vendored parse module behaves as Bleach needs it to

Given that, this is not an upstream problem. We explicitly vendored parse for specific reasons and now the tests fail for those specific packages. That's outside the scope of my responsibility and it's not something I'm going to support.

You should file bugs with the relevant packages to patch the tests.

Thank you!

@willkg willkg closed this as completed Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
untriaged Bug reports that haven't been triaged
Projects
None yet
Development

No branches or pull requests

3 participants