Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #11446: linkcheck should split url only if # part of anchor #11447

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rsokolewicz
Copy link

fixes #11446

as described in the issue, the link checker incorrectly processes URLs that have a # that is part of the directory path, rather than designating the anchor.

This pull request fixes this by splitting the URL using a regexp (.+/[^/]*)#([^/]*?/?$), which is a complicated way of saying we want to split the URL by # only if it is not followed by more forward slashes.

I've tested the implementation by hand on the URL https://microsoft.github.io/pyright/#/type-concepts-advanced as described in the issue and it works. So that nothing breaks in the future, I added a test that ensures that the following two URLs:

  • "https://www.example.com/path/to/resource#123"
    
  • "https://www.example.com/path/to/#/resource#123"
    

the anchor "123" is split off correctly. Although the test works, it's quite technical with the mockers that I used. If you guys have suggestions on how to improve, I'm all ears.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.

anchor = None
break
uri_pattern = r"(.+/[^/]*)#([^/]*?/?$)"
match = re.search(uri_pattern, uri)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For efficiency reason, I would suggest declaring the pattern in the global scope (or the class scope) and compiling it.

req_url, anchor = uri, None

for rex in self.anchors_ignore:
if anchor and rex.match(anchor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anchor can be checked outside of the loop:

if anchor:
    for rex in self.anchors_ignore:
        if rex.match(anchor):
            anchor = None
            break

@@ -728,3 +733,44 @@ def test_linkcheck_exclude_documents(app):
'info': 'br0ken_link matched br[0-9]ken_link from linkcheck_exclude_documents',
},
]


urls = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise not to use a global declaration here and simply write

@pytest.mark.parametrize("test_url", [
    "...."
])

Also, maybe rename "123" to "anchor" so that we know that it is an anchor.

@jayaddison
Copy link
Contributor

I'm not a regular contributor here, so please take my feedback with a grain of salt. My concern about this change isn't related to the code, but to standards (in particular RFC3986).

I don't think that handling the URL fragments this way would match the RFC3986 specification, and I think (although could be wrong) that the rfc3986 Python library demonstrates that:

>>> import rfc3986
>>> rfc3986.urlparse("https://www.example.com/path/to/#/resource#123")
ParseResult(scheme='https', userinfo=None, host='www.example.com', port=None, path='/path/to/', query=None, fragment='/resource#123')

If we think that's a bug with rfc3986 perhaps we could report it there - my guess is that pyright is using some a nonstandard URL scheme here however.

Supporting that kind of thing is often fine short-term to solve an individual case, but over a longer duration of time it can lead to standards fragmentation and resulting maintenance burden, confusion and compatibility problems.

I'll try to think of ways that we could address this and also handle the use case -- and perhaps that'll involve speaking to the pyright documentation maintainers -- but I'd like to suggest proceeding cautiously with any code changes.

(again, my contributions and involvement here are fairly minimal, so feel free to ignore all of the above)

@picnixz
Copy link
Member

picnixz commented Jun 2, 2023

@jayaddison pyright is not doing anything "wrong" per se. The use of /#/ is for SPA (single-page applications) that allows you to navigate between different pages without restarting the whole app and this only for the client framework router (for instance Angular's HashLocationStrategy vs PathLocationStrategy).

While I commented the PR without considering RFC3986, I agree that it might be better to enforce RFC3986 by default. We could, however, hard-detect whether /#/ is in the URI or not instead of using a regular expression, and then, split the path + query according to RFC.

@jayaddison
Copy link
Contributor

Yep, I've seen a few single-page applications use similar techniques, and since they can inspect the current URL from within their JavaScript code, they can implement patterns like this.

We don't have a JavaScript runtime within Sphinx, though, and it wouldn't seem great to follow a path of trying to implement edge case handling for all possible application configurations (that's the path I think that Sphinx should avoid, while considering alternative routes to achieve everyone's desired goals. I have a vague memory that there have been attempts to specify this in the past to make fragment-URLs crawlable by search engines).

@picnixz
Copy link
Member

picnixz commented Jun 2, 2023

What about hard-detecting /#/ then (instead of a regex as done by this PR?)

@jayaddison
Copy link
Contributor

Using /#/ as a special case would introduce the same problem: it'd essentially define an ad-hoc approach for how to handle some subset of URLs. I wouldn't be surprised if soon after (or, equally problematically, months or years after), we'd encounter requests about sites that use /#/ with a different anchor scheme - we end up adding further specialisations and customisations, often with conflicting objectives.

Where possible I think it's better to determine the objective, discuss that widely, gather feedback on ideas and problems, perhaps prototype an approach, and codify a standard (and then vote and agree upon it). It may seem like the standardisation process can take a long time, but I think that in the long term it's a more respectful approach and saves time (because less work is wasted, and systems become interoperable).

@jayaddison
Copy link
Contributor

we end up adding further specialisations and customisations, often with conflicting objectives.

(or we end up coming back here and saying 'nope, sorry - we changed our minds as maintainers and we don't want to support /#/ this way any more, so we shouldn't have implemented this change initially. possibly OK, depending on the way it's communicated and how urgent the use case is, but not a great experience for anyone, I think)

@picnixz
Copy link
Member

picnixz commented Jun 2, 2023

Fair enough. Interestingly,

https://microsoft.github.io/pyright/#type-concepts-advanced

points to the same URL as

https://microsoft.github.io/pyright/#/type-concepts-advanced,

@jayaddison
Copy link
Contributor

Everything after the first fragment identifier (#) is local to the client (and in fact not seen by the server), is my approximate understanding - so yep, I think that makes sense.

@picnixz
Copy link
Member

picnixz commented Jun 2, 2023

my approximate understanding

Don't worry you are correct. I only wanted to point this fact for (possible) future discussion.

@AA-Turner
Copy link
Member

This has conflicts (#11432).

The points raised in discussion are hard to resolve, as single-page applications don't behave 'normally', and having anchors on top of anchors is by necessity framework-dependent behaviour.

Perhaps not detecting any further / characters is a good-enough hueristic, though I'm unsure.

A

@picnixz picnixz added the awaiting:decision PR waiting for a consensus from maintainers. label Feb 25, 2024
@AA-Turner
Copy link
Member

@jayaddison I re-opened this PR during some triage, perhaps you could look again with a verdict? I hesitate to suggest a config option as I don't like adding more of them, but it could be a get-out here and allow us to provide support.

A

1 similar comment
@AA-Turner

This comment was marked as duplicate.

@rsokolewicz
Copy link
Author

Hi! I'm sorry for not replying earlier, I more or less forgot about this.

I agree with the discussion so far. Handling # correctly for all possible urls is difficult, and ideally it should not be the responsibility of sphinx to know about all the possible ways websites/browsers parse the # symbol.

This is not blocking me, so I'm okay if this never gets merged.

I'll reply to other comments later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting:decision PR waiting for a consensus from maintainers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

linkcheck incorrectly splits anchor if # is part of URI
4 participants