Skip to content

reserved characters are treated inconsistently and not sensibly preserved #16

Closed
@glyph

Description

@glyph

This has been a design flaw since the inception of the library, so, mea culpa on that.

Fundamentally, preserving, escaping, and encoding "reserved" characters is entirely the URL object's job, and it's failing at that. Possibly the most succinct demonstration of the problem is this:

>>> u = URL()
>>> u = u.child(u'/')
>>> u = u.asIRI()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    u = u.asIRI()
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 1116, in to_iri
    fragment=_percent_decode(self.fragment))
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 861, in replace
    userinfo=_optional(userinfo, self.userinfo),
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 606, in __init__
    for segment in path))
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 606, in <genexpr>
    for segment in path))
  File
"/Users/glyph/Library/Python/2.7/lib/python/site-packages/hyperlink/_url.py",
line 410, in _textcheck
    % (''.join(delims), name, value))
ValueError: one or more reserved delimiters /?# present in path segment: u'/'
>>>

This is - obviously I hope - the wrong place to be failing with an error like this.

There was previously some attempt to preserve these characters in the data model and escape them only upon stringification, but d26814c wrecked these semantics. (In fairness: the attempt to do this was broken, and there are some places, like the scheme, where certain characters indeed cannot be represented, so this direction isn't entirely wrong.)

Fundamentally if a user wants to encode slashes, question marks, hash signs or whatever else that a human might, for example, type into a text field, then it should be possible to do that.

We could fix this obvious manifestation of the problem by just putting back the escape-only-on-asText logic, but that still leaves an even more pernicious problem:

>>> u = URL(path=tuple([u'%2525']))
>>> u.asText()
u'%2525'
>>> u.asIRI().asText()
u'%25'
>>> u.asIRI().asIRI().asText()
u'%'
>>> 

Clearly, multiple trips through asIRI should not be un-escaping the escape character - the idea is that .asIRI() is a normalization step, that should be idempotent upon subsequent calls.

For the moment, I'm not sure exactly what the correct fix is here, but the property I'd really like to preserve is that for any x,

URL.fromText(URL().child(x).<as many asIRI()s or asURI()s as you want>.asText()).<as many .asIRI()s as you want, although possibly not .asURI()s>.segments[0] == x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions