Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

valid URL with hashtag in fragment do not pass validation #403

Closed
matany90 opened this issue Sep 15, 2024 · 7 comments · Fixed by #405
Closed

valid URL with hashtag in fragment do not pass validation #403

matany90 opened this issue Sep 15, 2024 · 7 comments · Fixed by #405

Comments

@matany90
Copy link

Hi,

The following URL is not passing validation:

https://www.foo.com/ck/a?!&&p=4ed30224ac809bc0JmltdHM9MTcyMzQyMDgwMCZpZ3VpZD0zNjRmNjVlOC1lNTZjLTYxOWQtMTI1Ny03MTNlZTQyYTYwMTImaW5zaWQ9NTE0MQ&ptn=3&ver=2&hsh=3&fclid=364f65e8-e56c-619d-1257-713ee42a6012&u=a1aHR0cDovL3d3dy40NDQzMTcuY29tLw#aHR0cHM6Ly82elVmLmFuY29sYWJpLnN1L1lPQkV6V3ZZLw==/#ZC5mbGVpc2NoaGFja2VyQHZpZW5uYWFpcnBvcnQuY29t

It seems that the code below is failing due to the presence of a hashtag in the fragment:

  if fragment:
        # See RFC3986 Section 3.5 Fragment for allowed characters
        optional_segments &= bool(
            re.fullmatch(r"[0-9a-z?/:@\-._~%!$&'()*+,;=]*", fragment, re.IGNORECASE)
        )
    return optional_segments

Thanks!

@yozachar
Copy link
Collaborator

ValidationError(func=url, args={'reason': "bad query field: '!'", 'value': '...'})

Seem like a bad query field. after ck/a? an alphabet is expected instead of !.

@davidt99
Copy link
Contributor

The issue is that there is a hashtag in the fragment, which, according to RFC3986 isn't valid. But the fragment isn't actually sent to the web server, so it's up to the browser to handle it. I would add another flag "validate_fragment" or "strict_fragment" to handle it, WDYT?

@yozachar
Copy link
Collaborator

yozachar commented Sep 25, 2024

RFC3986
[3.5].  Fragment

   The fragment identifier component of a URI allows indirect
   identification of a secondary resource by reference to a primary
   resource and additional identifying information.  The identified
   secondary resource may be some portion or subset of the primary
   resource, some view on representations of the primary resource, or
   some other resource defined or described by those representations.  A
   fragment identifier component is indicated by the presence of a
   number sign ("#") character and terminated by the end of the URI.

      fragment    = *( pchar / "/" / "?" )

   The semantics of a fragment identifier are defined by the set of
   representations that might result from a retrieval action on the
   primary resource.  The fragment's format and resolution is therefore
   dependent on the media type [RFC2046] of a potentially retrieved
   representation, even though such a retrieval is only performed if the
   URI is dereferenced.  If no such representation exists, then the
   semantics of the fragment are considered unknown and are effectively
   unconstrained.  Fragment identifier semantics are independent of the
   URI scheme and thus cannot be redefined by scheme specifications.

   Individual media types may define their own restrictions on or
   structures within the fragment identifier syntax for specifying
   different types of subsets, views, or external references that are
   identifiable as secondary resources by that media type.  If the
   primary resource has multiple representations, as is often the case
   for resources whose representation is selected based on attributes of
   the retrieval request (a.k.a., content negotiation), then whatever is
   identified by the fragment should be consistent across all of those
   representations.  Each representation should either define the
   fragment so that it corresponds to the same secondary resource,
   regardless of how it is represented, or should leave the fragment
   undefined (i.e., not found).

   As with any URI, use of a fragment identifier component does not
   imply that a retrieval action will take place.  A URI with a fragment
   identifier may be used to refer to the secondary resource without any
   implication that the primary resource is accessible or will ever be
   accessed.

   Fragment identifiers have a special role in information retrieval
   systems as the primary form of client-side indirect referencing,
   allowing an author to specifically identify aspects of an existing
   resource that are only indirectly provided by the resource owner.  As
   such, the fragment identifier is not used in the scheme-specific
   processing of a URI; instead, the fragment identifier is separated
   from the rest of the URI prior to a dereference, and thus the
   identifying information within the fragment itself is dereferenced
   solely by the user agent, regardless of the URI scheme.  Although
   this separate handling is often perceived to be a loss of
   information, particularly for accurate redirection of references as
   resources move over time, it also serves to prevent information
   providers from denying reference authors the right to refer to
   information within a resource selectively.  Indirect referencing also
   provides additional flexibility and extensibility to systems that use
   URIs, as new media types are easier to define and deploy than new
   schemes of identification.

   The characters slash ("/") and question mark ("?") are allowed to
   represent data within the fragment identifier.  Beware that some
   older, erroneous implementations may not handle this data correctly
   when it is used as the base URI for relative references [Section]
   [5.1].

... hashtag in the fragment, which, according to RFC3986 isn't valid ...

It does not explicitly state that # is invalid within a fragment, does it?

@yozachar
Copy link
Collaborator

yozachar commented Sep 25, 2024

Well actually, https://stackoverflow.com/questions/26088849/url-fragment-allowed-characters/26119120#26119120 explains how pchar is defined. But indeed, browsers are lenient towards URLs with multiple # signs.

validators.urls() already accepts a lot of arguments. One more flag (like strict_fragment) just to swap out a character, seems like over-engineering. Adding # directly to the fragment regex validator doesn't fail tests, though.

@davidt99
Copy link
Contributor

It makes sense, but it means we are deviating from the standard.
Another solution would be to move the regex string to a module variable, so that a user can override it in case more characters need to be added. WDYT?

@yozachar
Copy link
Collaborator

yozachar commented Sep 28, 2024

Another solution would be to move the regex string to a module variable, so that a user can override it in case more characters need to be added. WDYT?

A similar idea of modifiable validation was raised here: #396 (comment). I think it defeats the purpose of this library.

It makes sense, but it means we are deviating from the standard.

True, but aren't de-facto standards compelling in a practical sense?

@davidt99
Copy link
Contributor

davidt99 commented Oct 6, 2024

I think the library is done more than just regex when working with URLs - it parse the url correctly and do various checks.
I'll open a PR for adding #. If another character will be needed in fragment, we can rethink this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants