-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reserved characters are treated inconsistently and not sensibly preserved #16
Comments
hopefully this is temporary just until python-hyper/hyperlink#16 is resolved.
Agreed on the invariant. It's quite late, but I have an intuition I'd like to put to paper, as it were.
|
It isn't obvious that this was broken before, but it definitely is now (since d26814c). The library cannot know if a path segment is supposed to include relative sub-parts like The repeat encoding/decoding functionality is not a bug - it is correct implementation of the percent encoding and decoding rules. Only the calling code can know how many times it has been called. I think it is putting too much cleverness into Quite separately, I believe the effect of Apologies in advance if I've misunderstood the intent of this issue, it's a complex area and I've probably made some mistakes understanding it all! |
Hey @nigelmegitt! I think you've got the right idea as to the intent of the issue, that I'll delve into this particular case and see if I can diagnose the issue. Might get a good test case out of it, too :) An independent issue and stack trace might speed that along, if you wouldn't mind. |
I think that the right thing to do for now might actually be refuse to ever decode these meta-characters; i.e. Right now, The thing to do for segments that encode slashes, question marks and percent signs, I believe, would be to have a separate attribute without the exception-raising restriction that we currently have; |
I just wanted to respond to this claim in particular - the library can totally know :). You just need to have attributes which unambiguously represent the data according to a particular perspective. If |
I think #23 fixed the "defect" part of this, but there's still the "feature" part - we need some way to shove a "/" into a path segment and get it back out again if an application wants to do that. |
I think "need" is a strong word here. I'd like to see such a need demonstrated before putting the defect at the same level as the feature. For instance, can you get werkzeug or anything else to even have a rule to match a slash in a path segment? I'd term it an advanced use case at the very least. Maybe open a broader discussion to making the |
One could quibble about path segments but certainly query parameters need this treatment; they are often used by applications as a map for storing quasi-arbitrary data. |
Sure, that's how a lot of PayPal's structured logging worked. Based on my experience, I'd've preferred a dedicated object for that and/or encode/decode functions with clear uses and explanations. Another approach that occurs to me, but I'm only mentioning for brainstorming purposes, is to have a single proxy object wrapper utility that marks a value as unescaped, and have each field check for that and apply the respective encoding (query encoding for query string parts, path encoding for path parts, etc.). u = URL()
u.replace(path=escape('a', '///', 'b')) I don't know why, but it feels kind of Django-y. :P |
I really don't like the idea of another top-level name, especially not the heavily fraught Along the lines of a dedicated object here, what if there were a wrapper type, like this: u = URL()
u.replace(path=["%25"]) → URL.fromText('%25')
u.decoded.replace(path=["%25"]) → URL.fromText('%2525')
URL.fromText('%25').decoded.path[0] → '%' The two problems I see here are:
But I think I prefer this general idea, if those can be addressed / made clear somehow. |
@glyph wrote:
OK, with this approach I cannot add a sub-path to the segment as If you allow adding a sub-path then you need to accept that sometimes it will have slashes in and sometimes it might have %-encoded slashes that are intended to be retained as such. In which case you really have to trust that the providing code was written by someone who knew what they were doing. And then when you pass that string on somewhere and it is decoded then any sub-paths with slashes in will be itemised separately. On the plus side you don't need to check the string for slashes. If they're there, it's fine. Alternatively if you do require each segment to be appended separately then the API needs to be clear about what it is going to do if there is a reserved character in it - will it always percent encode the provided string, or will it never percent encode it but always reject an incorrectly encoded string? |
Indeed, this is the way that |
The approach right now is to just raise an exception, which seems like it neatly short-circuits all these issues. If we add a |
In the case of In the case of |
just as a point of order, navigation and appending path segments is meant to be done with .sibling() and .child(), not .replace(). Also, one may say "escape" is fraught, but looking at Python code these days, one sees a lot more |
What about when we just create a URL from a unicode string including all the parts? Are you saying that functionality is no longer supported? |
Well it bumps the issues down the line by throwing an exception if an otherwise perfectly reasonably %-encoded |
No, that functionality is |
OK, I'm missing something then. As far as I can tell, at no point has hyperlink ever raised an exception when given e.g.
|
Here's the issue as simply as I can make it in 17.2.1: $ pip show hyperlink
Name: hyperlink
Version: 17.2.1
Summary: A featureful, correct URL for Python.
Home-page: https://github.com/python-hyper/hyperlink
Author: Mahmoud Hashemi and Glyph Lefkowitz
Author-email: mahmoud@hatnote.com
License: BSD
Location: /Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages
Requires:
$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from hyperlink import URL
>>> url = URL.from_text(u'ws://localhost:9006/sequence%2F%C3%BCnic%C3%B6d%C3%A9%3F%2FName/subscribe')
>>> iri = url.to_iri()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "*****/venv/lib/python2.7/site-packages/hyperlink/_url.py", line 1116, in to_iri
fragment=_percent_decode(self.fragment))
File "*****/venv/lib/python2.7/site-packages/hyperlink/_url.py", line 861, in replace
userinfo=_optional(userinfo, self.userinfo),
File "*****/venv/lib/python2.7/site-packages/hyperlink/_url.py", line 606, in __init__
for segment in path))
File "*****/venv/lib/python2.7/site-packages/hyperlink/_url.py", line 606, in <genexpr>
for segment in path))
File "*****/venv/lib/python2.7/site-packages/hyperlink/_url.py", line 410, in _textcheck
% (''.join(delims), name, value))
ValueError: one or more reserved delimiters /?# present in path segment: u'sequence/\xfcnic\xf6d\xe9?/Name'
>>> quit() Now here's the same code working on an older version:
Apologies again if this is actually fixed by #23 but it explains my earlier comment. |
Yep, fixed on trunk, don't you fret! >>> from hyperlink import URL
>>> url = URL.from_text(u'ws://localhost:9006/sequence%2F%C3%BCnic%C3%B6d%C3%A9%3F%2FName/subscribe')
>>> iri = url.to_iri()
>>> print iri.to_text()
ws://localhost:9006/sequence%2Fünicödé%3F%2FName/subscribe |
Great, thank you! |
The release published last night (17.3.0) fixes all this behavior. Give it a spin, let me know what you think. Thanks again for the report! |
Appears that python-hyper/hyperlink#16 has been resolved.
Pulls in upstream fixes from Release/2.1.2 (ebu#522) and fixes an issue where, on conversion to EBU-TT-D, `<br>` children of `<p>` elements that have child `<span>` elements with differing timing also get wrapped in appropriately timed `<span>`s and therefore are not active always. Fixes a presentation error for some content. ### New stuff * Add and pass tests Drive-by discovery and fix of a bug in `denester.py` that did not handle `span` elements with an indefinite computed end time correctly. In `ebuttd_nested_elements.feature`, the new code bumps the span that gets style S2S6 along one, because the `br` element before it now gets wrapped in a span. Add a BDD for timings present on both p and consecutive span elements with an intervening br element. * Remove Pipfile and don't need pytest-catchlog * Up version to 3.0.1 * Remove restriction on hyperlink Appears that python-hyper/hyperlink#16 has been resolved. ### Notes from pulled in changes: * Add service_identity module dependency Removes a test warning, improves twisted's ability to verify certificates. * Fix ebu#474 Use relative links to templates. Put message payload into the UIP display so that the receiver can read it without having to open the message XML up. * Remove requirement for pytest-capturelog, removed from No longer needed because it has been merged into the core. * Refactor proxy handling for client Track a change to proxy handling in Autobahn / twisted for ebu#477. There's probably a more elegant way, but this works for now... Lacking a test, but have verified it locally. * More elegant fix for proxy We don't need the HTTPProxyConfig object at all anymore, just a `dict` will do. * Don't call fixtures directly (ebu#490). Calling fixtures directly is deprecated, this solution described at https://docs.pytest.org/en/latest/deprecations.html#calling-fixtures-directly seems to work, creating a named fixture rather than defining the "then" step as a fixture directly. * Restrict autobahn to versions before 18 Temporarily fixes ebu#496, prior to addressing the problem properly anyway. * Restrict autobahn to versions before 18 Temporarily fixes ebu#496, prior to addressing the problem properly anyway. * ebu#447 resequencer documentation (ebu#515) * Fix indentation warning * Fix underline length warning * Fix document generation warning * Add Resequencer Node documentation Closes ebu#447 . * Address review comment Merging now because this affects documentation only, and this allows progress to continue on other pull requests. * Handle timing types in elements without crashing (ebu#516) * Fix ebu#492 * Define the type of `ebuttm:documentStartOfProgramme` correctly. * When processing a timing type outside of the context of an attribute, bypass the timebase validation. This is possibly a temporary hack, since element timing validation might be needed one day, but since we don't do anything with metadata elements, this isn't a disaster (yet). * Add tests Checks that including a valid time in a `ebuttm:documentStartOfProgramme` element does not cause a processing or validation error. * ebu#507 timedelta (ebu#508) * Correctly interpret decimal fractions of seconds Fixes ebu#507 by interpreting decimal fractions of seconds as milliseconds and adding tests to verify this behaviour for 3 and 4 decimal places. Removes `_int_or_none` which doesn't seem to be needed * Specify incompatible attributes (ebu#518) Fixes ebu#514. * Resequencer: allow to immediately issue documents (ebu#510) So far the resequencer starts to regularly issue documents only after the first EBU-TT Live document has been received. However for some use cases this might be inconvenient as an active document must exist at all times e.g. when creating segments for an MPEG-DASH stream. The reason for documents being issued only after the first received EBU-TT Live document is that certain parameters of that document are used for initialisation. This commit adds a new configuration parameter that specifies a document which will be used for initialisation (instead of the first received EBU-TT Live document). Therefore the resequencer will immediately (be able to) start issuing documents after its creation. Closes ebu#505. * Resequencer: fix sequence number 1 if no content (ebu#509) While the EBUTT3Splicer is provided with the current sequence number stored/incremented by the Resequencer, the `create_compatible_document` method so far refers to the sequence number stored internally by EBUTT3DocumentSequence instead. This leads to a sequence number of 1 in the issued output document if no subtitle content is available. To align with the correct (strictly monotonic increasing) sequence number, always use the one stored by the Resequencer. Fixes ebu#502.
This has been a design flaw since the inception of the library, so, mea culpa on that.
Fundamentally, preserving, escaping, and encoding "reserved" characters is entirely the URL object's job, and it's failing at that. Possibly the most succinct demonstration of the problem is this:
This is - obviously I hope - the wrong place to be failing with an error like this.
There was previously some attempt to preserve these characters in the data model and escape them only upon stringification, but d26814c wrecked these semantics. (In fairness: the attempt to do this was broken, and there are some places, like the scheme, where certain characters indeed cannot be represented, so this direction isn't entirely wrong.)
Fundamentally if a user wants to encode slashes, question marks, hash signs or whatever else that a human might, for example, type into a text field, then it should be possible to do that.
We could fix this obvious manifestation of the problem by just putting back the escape-only-on-
asText
logic, but that still leaves an even more pernicious problem:Clearly, multiple trips through
asIRI
should not be un-escaping the escape character - the idea is that.asIRI()
is a normalization step, that should be idempotent upon subsequent calls.For the moment, I'm not sure exactly what the correct fix is here, but the property I'd really like to preserve is that for any
x
,URL.fromText(URL().child(x).<as many asIRI()s or asURI()s as you want>.asText()).<as many .asIRI()s as you want, although possibly not .asURI()s>.segments[0] == x
The text was updated successfully, but these errors were encountered: