Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reserved characters are treated inconsistently and not sensibly preserved #16

Closed
glyph opened this issue Jun 25, 2017 · 24 comments
Closed

Comments

@glyph
Copy link
Collaborator

glyph commented Jun 25, 2017

This has been a design flaw since the inception of the library, so, mea culpa on that.

Fundamentally, preserving, escaping, and encoding "reserved" characters is entirely the URL object's job, and it's failing at that. Possibly the most succinct demonstration of the problem is this:

>>> u = URL()
>>> u = u.child(u'/')
>>> u = u.asIRI()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    u = u.asIRI()
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 1116, in to_iri
    fragment=_percent_decode(self.fragment))
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 861, in replace
    userinfo=_optional(userinfo, self.userinfo),
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 606, in __init__
    for segment in path))
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 606, in <genexpr>
    for segment in path))
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 410, in _textcheck
    % (''.join(delims), name, value))
ValueError: one or more reserved delimiters /?# present in path segment: u'/'
>>>

This is - obviously I hope - the wrong place to be failing with an error like this.

There was previously some attempt to preserve these characters in the data model and escape them only upon stringification, but d26814c wrecked these semantics. (In fairness: the attempt to do this was broken, and there are some places, like the scheme, where certain characters indeed cannot be represented, so this direction isn't entirely wrong.)

Fundamentally if a user wants to encode slashes, question marks, hash signs or whatever else that a human might, for example, type into a text field, then it should be possible to do that.

We could fix this obvious manifestation of the problem by just putting back the escape-only-on-asText logic, but that still leaves an even more pernicious problem:

>>> u = URL(path=tuple([u'%2525']))
>>> u.asText()
u'%2525'
>>> u.asIRI().asText()
u'%25'
>>> u.asIRI().asIRI().asText()
u'%'
>>> 

Clearly, multiple trips through asIRI should not be un-escaping the escape character - the idea is that .asIRI() is a normalization step, that should be idempotent upon subsequent calls.

For the moment, I'm not sure exactly what the correct fix is here, but the property I'd really like to preserve is that for any x,

URL.fromText(URL().child(x).<as many asIRI()s or asURI()s as you want>.asText()).<as many .asIRI()s as you want, although possibly not .asURI()s>.segments[0] == x

@mahmoud
Copy link
Member

mahmoud commented Jun 25, 2017

Agreed on the invariant. It's quite late, but I have an intuition I'd like to put to paper, as it were.

asIRI has this excessive escaping behavior, but I haven't seen excessive unescaping from asURI, and I wonder if it's as simple as providing specialized decoding functions for each part, to match the specialized encoding functions (e.g., _encode_query_part). The decoding felt a bit simplistic, and I suspect this is why.

@nigelmegitt
Copy link

It isn't obvious that this was broken before, but it definitely is now (since d26814c). The library cannot know if a path segment is supposed to include relative sub-parts like outer/middle/inner or if that should be % encoded as outer%2fmiddle%2finner - it can only deal with what it is given.

The repeat encoding/decoding functionality is not a bug - it is correct implementation of the percent encoding and decoding rules. Only the calling code can know how many times it has been called. I think it is putting too much cleverness into URL to try to make it idempotent in the way this issue calls for, though I understand that it seems like an elegant solution. Instead I think if an application is calling u.asIRI().asIRI()... that is almost certainly a usage error, and if it is not a usage error, it's in the calling code's domain to know that.

Quite separately, I believe the effect of _textcheck() here, which throws if path segments include percent encoded reserved characters is a bug. There's simply no problem having percent encoded characters in a URI or IRI. That's what they're for.

Apologies in advance if I've misunderstood the intent of this issue, it's a complex area and I've probably made some mistakes understanding it all!

@mahmoud
Copy link
Member

mahmoud commented Jun 26, 2017

Hey @nigelmegitt! I think you've got the right idea as to the intent of the issue, that .asIRI be idempotent. As for _textcheck, it's totally percent-encoding agnostic, so I'm pretty sure the problem lies between there and the top level _create_client method you linked from here.

I'll delve into this particular case and see if I can diagnose the issue. Might get a good test case out of it, too :) An independent issue and stack trace might speed that along, if you wouldn't mind.

@glyph
Copy link
Collaborator Author

glyph commented Jul 1, 2017

I think that the right thing to do for now might actually be refuse to ever decode these meta-characters; i.e. URL.from_text('%25%2F').asIRI()URL.from_text('%25%2F').

Right now, .path and .query are literal text as they appear in the URI itself. This makes sense as a data structure, although it is somewhat limiting. It's also a necessary piece of structure to preserve, because the conversion to "decoded" characters is not information-preserving: since a percent-encoded sequence might represent an invalid UTF-8 sequence of octets, for example. But it also means that reserved characters can never appear in them without escaping.

The thing to do for segments that encode slashes, question marks and percent signs, I believe, would be to have a separate attribute without the exception-raising restriction that we currently have; decoded_path for a punching-bag name suggestion.

@glyph
Copy link
Collaborator Author

glyph commented Jul 1, 2017

The library cannot know if a path segment is supposed to include relative sub-parts like outer/middle/inner or if that should be % encoded as outer%2fmiddle%2finner - it can only deal with what it is given.

I just wanted to respond to this claim in particular - the library can totally know :). You just need to have attributes which unambiguously represent the data according to a particular perspective. If .path is "the segments as they appear between slashes" then it's easy to establish the invariant that the strings in it can't contain slashes.

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

I think #23 fixed the "defect" part of this, but there's still the "feature" part - we need some way to shove a "/" into a path segment and get it back out again if an application wants to do that.

@mahmoud
Copy link
Member

mahmoud commented Jul 3, 2017

I think "need" is a strong word here. I'd like to see such a need demonstrated before putting the defect at the same level as the feature. For instance, can you get werkzeug or anything else to even have a rule to match a slash in a path segment?

I'd term it an advanced use case at the very least. Maybe open a broader discussion to making the _encode_* and _decode_* functions public in a sort of hazmat way. Now that I understand those semantics, I think they're pretty stable (and way better than the anemic footgusn in the stdlib).

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

One could quibble about path segments but certainly query parameters need this treatment; they are often used by applications as a map for storing quasi-arbitrary data.

@mahmoud
Copy link
Member

mahmoud commented Jul 3, 2017

Sure, that's how a lot of PayPal's structured logging worked. Based on my experience, I'd've preferred a dedicated object for that and/or encode/decode functions with clear uses and explanations.

Another approach that occurs to me, but I'm only mentioning for brainstorming purposes, is to have a single proxy object wrapper utility that marks a value as unescaped, and have each field check for that and apply the respective encoding (query encoding for query string parts, path encoding for path parts, etc.).

u = URL()
u.replace(path=escape('a', '///', 'b'))

I don't know why, but it feels kind of Django-y. :P

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

I really don't like the idea of another top-level name, especially not the heavily fraught escape.

Along the lines of a dedicated object here, what if there were a wrapper type, like this:

u = URL()
u.replace(path=["%25"]) → URL.fromText('%25')
u.decoded.replace(path=["%25"]) → URL.fromText('%2525')
URL.fromText('%25').decoded.path[0] → '%'

The two problems I see here are:

  1. not super in love with the name .decoded
  2. it might be a little confusing that .decoded gives you a _DecodedURL or whatever, but .replace gives you back a regular URL again.

But I think I prefer this general idea, if those can be addressed / made clear somehow.

@nigelmegitt
Copy link

nigelmegitt commented Jul 3, 2017

@glyph wrote:

If .path is "the segments as they appear between slashes" then it's easy to establish the invariant that the strings in it can't contain slashes.

OK, with this approach I cannot add a sub-path to the segment as some/path/to in one step, I have to add multiple path segments in turn. Let's trace through the impacts of allowing adding such sub-paths or not allowing it:

If you allow adding a sub-path then you need to accept that sometimes it will have slashes in and sometimes it might have %-encoded slashes that are intended to be retained as such. In which case you really have to trust that the providing code was written by someone who knew what they were doing. And then when you pass that string on somewhere and it is decoded then any sub-paths with slashes in will be itemised separately. On the plus side you don't need to check the string for slashes. If they're there, it's fine.

Alternatively if you do require each segment to be appended separately then the API needs to be clear about what it is going to do if there is a reserved character in it - will it always percent encode the provided string, or will it never percent encode it but always reject an incorrectly encoded string?

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

OK, with this approach I cannot add a sub-path to the segment as some/path/to in one step, I have to add multiple path segments in turn. Let's trace through the impacts of allowing adding such sub-paths or not allowing it:

Indeed, this is the way that FilePath works, and URL intentionally mirrors it, for security (and other) reasons.

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

If you allow adding a sub-path then you need to accept that sometimes it will have slashes in and sometimes it might have %-encoded slashes that are intended to be retained as such. In which case you really have to trust that the providing code was written by someone who knew what they were doing. And then when you pass that string on somewhere and it is decoded then any sub-paths with slashes in will be itemised separately.

The approach right now is to just raise an exception, which seems like it neatly short-circuits all these issues.

If we add a .decoded object, then the answer in that case will be "always escape".

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

Alternatively if you do require each segment to be appended separately then the API needs to be clear about what it is going to do if there is a reserved character in it - will it always percent encode the provided string, or will it never percent encode it but always reject an incorrectly encoded string?

In the case of URL itself: always reject, never encode.

In the case of .decoded: always encode, never reject.

@mahmoud
Copy link
Member

mahmoud commented Jul 3, 2017

just as a point of order, navigation and appending path segments is meant to be done with .sibling() and .child(), not .replace().

Also, one may say "escape" is fraught, but looking at Python code these days, one sees a lot more encode() and decode() than escape, quote, and a lot else :P

@nigelmegitt
Copy link

OK, with this approach I cannot add a sub-path to the segment as some/path/to in one step, I have to add multiple path segments in turn. Let's trace through the impacts of allowing adding such sub-paths or not allowing it:

Indeed, this is the way that FilePath works, and URL intentionally mirrors it, for security (and other) reasons.

What about when we just create a URL from a unicode string including all the parts? Are you saying that functionality is no longer supported?

@nigelmegitt
Copy link

The approach right now is to just raise an exception, which seems like it neatly short-circuits all these issues.

Well it bumps the issues down the line by throwing an exception if an otherwise perfectly reasonably %-encoded /, ? or # is present in the segment. This didn't happen before hyperlink release 17.2.0, so I'm stuck with the previous release until I can construct a URI from a string that includes those (encoded) entities in the path. Maybe I'm missing some newly added functionality for doing this?

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

What about when we just create a URL from a unicode string including all the parts? Are you saying that functionality is no longer supported?

No, that functionality is fromText, and is irrelevant to this conversation, because the way you tell fromText about new path segments is that you put a / into the path :).

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

Well it bumps the issues down the line by throwing an exception if an otherwise perfectly reasonably %-encoded /, ? or # is present in the segment.

OK, I'm missing something then. As far as I can tell, at no point has hyperlink ever raised an exception when given e.g. URL(path=[u'%25%2F%23%3F']). In 17.2.1, that.asIRI() will raise an exception, which is definitely a bug, but that bug has been fixed in #23.

This didn't happen before hyperlink release 17.2.0, so I'm stuck with the previous release until I can construct a URI from a string that includes those (encoded) entities in the path. Maybe I'm missing some newly added functionality for doing this?

URL.fromText(u'%25%2F%23%3F') works on 17.2.1 as well, although the fact that you can't normalize it is definitely a bummer; again, that'll be fixed in the next release.

@nigelmegitt
Copy link

Here's the issue as simply as I can make it in 17.2.1:

$ pip show hyperlink
Name: hyperlink
Version: 17.2.1
Summary: A featureful, correct URL for Python.
Home-page: https://github.com/python-hyper/hyperlink
Author: Mahmoud Hashemi and Glyph Lefkowitz
Author-email: mahmoud@hatnote.com
License: BSD
Location: /Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages
Requires: 
$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from hyperlink import URL
>>> url = URL.from_text(u'ws://localhost:9006/sequence%2F%C3%BCnic%C3%B6d%C3%A9%3F%2FName/subscribe')
>>> iri = url.to_iri()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "*****/venv/lib/python2.7/site-packages/hyperlink/_url.py", line 1116, in to_iri
    fragment=_percent_decode(self.fragment))
  File "*****/venv/lib/python2.7/site-packages/hyperlink/_url.py", line 861, in replace
    userinfo=_optional(userinfo, self.userinfo),
  File "*****/venv/lib/python2.7/site-packages/hyperlink/_url.py", line 606, in __init__
    for segment in path))
  File "*****/venv/lib/python2.7/site-packages/hyperlink/_url.py", line 606, in <genexpr>
    for segment in path))
  File "*****/venv/lib/python2.7/site-packages/hyperlink/_url.py", line 410, in _textcheck
    % (''.join(delims), name, value))
ValueError: one or more reserved delimiters /?# present in path segment: u'sequence/\xfcnic\xf6d\xe9?/Name'
>>> quit()

Now here's the same code working on an older version:

$ pip install 'hyperlink<17.2.0'
Collecting hyperlink<17.2.0
  Using cached hyperlink-17.1.1-py2.py3-none-any.whl
Installing collected packages: hyperlink
  Found existing installation: hyperlink 17.2.1
    Uninstalling hyperlink-17.2.1:
      Successfully uninstalled hyperlink-17.2.1
Successfully installed hyperlink-17.1.1
EBU mc-s092695:ebu-tt-live-toolkit megitn02$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from hyperlink import URL
>>> url = URL.from_text(u'ws://localhost:9006/sequence%2F%C3%BCnic%C3%B6d%C3%A9%3F%2FName/subscribe')
>>> iri = url.to_iri()
>>> print iri.to_text()
ws://localhost:9006/sequence%2Fünicödé%3F%2FName/subscribe
>>> quit()

Apologies again if this is actually fixed by #23 but it explains my earlier comment.

@mahmoud
Copy link
Member

mahmoud commented Jul 3, 2017

Yep, fixed on trunk, don't you fret!

>>> from hyperlink import URL
>>> url = URL.from_text(u'ws://localhost:9006/sequence%2F%C3%BCnic%C3%B6d%C3%A9%3F%2FName/subscribe')
>>> iri = url.to_iri()
>>> print iri.to_text()
ws://localhost:9006/sequence%2Fünicödé%3F%2FName/subscribe

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

Here's the issue as simply as I can make it in 17.2.1

Yep, as @mahmoud said, that's exactly what #23 fixed. Sorry we were talking past each other a little bit - that what I meant by "that.asIRI() will raise an exception".

@nigelmegitt
Copy link

Great, thank you!

@mahmoud
Copy link
Member

mahmoud commented Jul 19, 2017

The release published last night (17.3.0) fixes all this behavior. Give it a spin, let me know what you think. Thanks again for the report!

@mahmoud mahmoud closed this as completed Jul 19, 2017
nigelmegitt added a commit to bbc/ebu-tt-live-toolkit that referenced this issue Nov 29, 2021
nigelmegitt added a commit to bbc/ebu-tt-live-toolkit that referenced this issue Dec 7, 2021
Pulls in upstream fixes from Release/2.1.2 (ebu#522) and fixes an issue where, on conversion to EBU-TT-D, `<br>` children of `<p>` elements that have child `<span>` elements with differing timing also get wrapped in appropriately timed `<span>`s and therefore are not active always. Fixes a presentation error for some content.

### New stuff

* Add and pass tests

Drive-by discovery and fix of a bug in `denester.py` that did not handle `span` elements with an indefinite computed end time correctly.

In `ebuttd_nested_elements.feature`, the new code bumps the span that gets style S2S6 along one, because the `br` element before it now gets wrapped in a span.

Add a BDD for timings present on both p and consecutive span elements with an intervening br element.

* Remove Pipfile and don't need pytest-catchlog

* Up version to 3.0.1

* Remove restriction on hyperlink

Appears that python-hyper/hyperlink#16 has been resolved.


### Notes from pulled in changes:

* Add service_identity module dependency

Removes a test warning, improves twisted's ability to verify certificates.

* Fix ebu#474

Use relative links to templates.
Put message payload into the UIP display so that the receiver can read it without having to open the message XML up.

* Remove requirement for pytest-capturelog, removed from

No longer needed because it has been merged into the core.

* Refactor proxy handling for client

Track a change to proxy handling in Autobahn / twisted for ebu#477. There's probably a more elegant way, but this works for now... Lacking a test, but have verified it locally.

* More elegant fix for proxy

We don't need the HTTPProxyConfig object at all anymore, just a `dict` will do.

* Don't call fixtures directly (ebu#490).

Calling fixtures directly is deprecated, this solution described at
https://docs.pytest.org/en/latest/deprecations.html#calling-fixtures-directly
seems to work, creating a named fixture rather than defining the "then"
step as a fixture directly.

* Restrict autobahn to versions before 18

Temporarily fixes ebu#496, prior to addressing the problem properly anyway.

* Restrict autobahn to versions before 18

Temporarily fixes ebu#496, prior to addressing the problem properly anyway.

* ebu#447 resequencer documentation (ebu#515)

* Fix indentation warning

* Fix underline length warning

* Fix document generation warning

* Add Resequencer Node documentation

Closes ebu#447 .

* Address review comment

Merging now because this affects documentation only, and this allows progress to continue on other pull requests.

* Handle timing types in elements without crashing (ebu#516)

* Fix ebu#492

* Define the type of `ebuttm:documentStartOfProgramme` correctly.
* When processing a timing type outside of the context of an attribute, bypass the timebase validation.

This is possibly a temporary hack, since element timing validation might be needed one day, but since we don't do anything with metadata elements, this isn't a disaster (yet).

* Add tests

Checks that including a valid time in a `ebuttm:documentStartOfProgramme` element does not cause a processing or validation error.

* ebu#507 timedelta (ebu#508)

* Correctly interpret decimal fractions of seconds

Fixes ebu#507 by interpreting decimal fractions of seconds as milliseconds and adding tests to verify this behaviour for 3 and 4 decimal places.

Removes `_int_or_none` which doesn't seem to be needed

* Specify incompatible attributes (ebu#518)

Fixes ebu#514.

* Resequencer: allow to immediately issue documents (ebu#510)

So far the resequencer starts to regularly issue documents only after
the first EBU-TT Live document has been received. However for some use
cases this might be inconvenient as an active document must exist at all
times e.g. when creating segments for an MPEG-DASH stream. The reason
for documents being issued only after the first received EBU-TT Live
document is that certain parameters of that document are used for
initialisation.

This commit adds a new configuration parameter that specifies a document
which will be used for initialisation (instead of the first received
EBU-TT Live document). Therefore the resequencer will immediately (be
able to) start issuing documents after its creation.

Closes ebu#505.

* Resequencer: fix sequence number 1 if no content (ebu#509)

While the EBUTT3Splicer is provided with the current sequence number
stored/incremented by the Resequencer, the `create_compatible_document`
method so far refers to the sequence number stored internally by
EBUTT3DocumentSequence instead. This leads to a sequence number of 1 in
the issued output document if no subtitle content is available.

To align with the correct (strictly monotonic increasing) sequence
number, always use the one stored by the Resequencer.

Fixes ebu#502.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants