fix #11446: linkcheck should split url only if # part of anchor #11447

rsokolewicz · 2023-05-28T22:36:41Z

as described in the issue, the link checker incorrectly processes URLs that have a # that is part of the directory path, rather than designating the anchor.

This pull request fixes this by splitting the URL using a regexp (.+/[^/]*)#([^/]*?/?$), which is a complicated way of saying we want to split the URL by # only if it is not followed by more forward slashes.

I've tested the implementation by hand on the URL https://microsoft.github.io/pyright/#/type-concepts-advanced as described in the issue and it works. So that nothing breaks in the future, I added a test that ensures that the following two URLs:

"https://www.example.com/path/to/resource#123"

"https://www.example.com/path/to/#/resource#123"

the anchor "123" is split off correctly. Although the test works, it's quite technical with the mockers that I used. If you guys have suggestions on how to improve, I'm all ears.

picnixz

A few comments.

picnixz · 2023-05-30T13:08:59Z

sphinx/builders/linkcheck.py

-                        anchor = None
-                        break
+            uri_pattern = r"(.+/[^/]*)#([^/]*?/?$)"
+            match = re.search(uri_pattern, uri)


For efficiency reason, I would suggest declaring the pattern in the global scope (or the class scope) and compiling it.

picnixz · 2023-05-30T13:10:33Z

sphinx/builders/linkcheck.py

+                req_url, anchor = uri, None
+
+            for rex in self.anchors_ignore:
+                if anchor and rex.match(anchor):


anchor can be checked outside of the loop:

if anchor: for rex in self.anchors_ignore: if rex.match(anchor): anchor = None break

picnixz · 2023-05-30T13:14:23Z

tests/test_build_linkcheck.py

@@ -728,3 +733,44 @@ def test_linkcheck_exclude_documents(app):
            'info': 'br0ken_link matched br[0-9]ken_link from linkcheck_exclude_documents',
        },
    ]
+
+
+urls = [


I would advise not to use a global declaration here and simply write

@pytest.mark.parametrize("test_url", [ "...." ])

Also, maybe rename "123" to "anchor" so that we know that it is an anchor.

jayaddison · 2023-06-01T17:45:05Z

I'm not a regular contributor here, so please take my feedback with a grain of salt. My concern about this change isn't related to the code, but to standards (in particular RFC3986).

I don't think that handling the URL fragments this way would match the RFC3986 specification, and I think (although could be wrong) that the rfc3986 Python library demonstrates that:

>>> import rfc3986
>>> rfc3986.urlparse("https://www.example.com/path/to/#/resource#123")
ParseResult(scheme='https', userinfo=None, host='www.example.com', port=None, path='/path/to/', query=None, fragment='/resource#123')

If we think that's a bug with rfc3986 perhaps we could report it there - my guess is that pyright is using some a nonstandard URL scheme here however.

Supporting that kind of thing is often fine short-term to solve an individual case, but over a longer duration of time it can lead to standards fragmentation and resulting maintenance burden, confusion and compatibility problems.

I'll try to think of ways that we could address this and also handle the use case -- and perhaps that'll involve speaking to the pyright documentation maintainers -- but I'd like to suggest proceeding cautiously with any code changes.

(again, my contributions and involvement here are fairly minimal, so feel free to ignore all of the above)

picnixz · 2023-06-02T08:53:25Z

@jayaddison pyright is not doing anything "wrong" per se. The use of /#/ is for SPA (single-page applications) that allows you to navigate between different pages without restarting the whole app and this only for the client framework router (for instance Angular's HashLocationStrategy vs PathLocationStrategy).

While I commented the PR without considering RFC3986, I agree that it might be better to enforce RFC3986 by default. We could, however, hard-detect whether /#/ is in the URI or not instead of using a regular expression, and then, split the path + query according to RFC.

jayaddison · 2023-06-02T09:26:43Z

Yep, I've seen a few single-page applications use similar techniques, and since they can inspect the current URL from within their JavaScript code, they can implement patterns like this.

We don't have a JavaScript runtime within Sphinx, though, and it wouldn't seem great to follow a path of trying to implement edge case handling for all possible application configurations (that's the path I think that Sphinx should avoid, while considering alternative routes to achieve everyone's desired goals. I have a vague memory that there have been attempts to specify this in the past to make fragment-URLs crawlable by search engines).

picnixz · 2023-06-02T09:29:26Z

What about hard-detecting /#/ then (instead of a regex as done by this PR?)

jayaddison · 2023-06-02T09:41:23Z

Using /#/ as a special case would introduce the same problem: it'd essentially define an ad-hoc approach for how to handle some subset of URLs. I wouldn't be surprised if soon after (or, equally problematically, months or years after), we'd encounter requests about sites that use /#/ with a different anchor scheme - we end up adding further specialisations and customisations, often with conflicting objectives.

Where possible I think it's better to determine the objective, discuss that widely, gather feedback on ideas and problems, perhaps prototype an approach, and codify a standard (and then vote and agree upon it). It may seem like the standardisation process can take a long time, but I think that in the long term it's a more respectful approach and saves time (because less work is wasted, and systems become interoperable).

jayaddison · 2023-06-02T09:42:56Z

we end up adding further specialisations and customisations, often with conflicting objectives.

(or we end up coming back here and saying 'nope, sorry - we changed our minds as maintainers and we don't want to support /#/ this way any more, so we shouldn't have implemented this change initially. possibly OK, depending on the way it's communicated and how urgent the use case is, but not a great experience for anyone, I think)

picnixz · 2023-06-02T09:49:05Z

Fair enough. Interestingly,

https://microsoft.github.io/pyright/#type-concepts-advanced

points to the same URL as

https://microsoft.github.io/pyright/#/type-concepts-advanced,

jayaddison · 2023-06-02T09:51:38Z

Everything after the first fragment identifier (#) is local to the client (and in fact not seen by the server), is my approximate understanding - so yep, I think that makes sense.

picnixz · 2023-06-02T09:53:44Z

my approximate understanding

Don't worry you are correct. I only wanted to point this fact for (possible) future discussion.

AA-Turner · 2023-07-20T20:19:59Z

This has conflicts (#11432).

The points raised in discussion are hard to resolve, as single-page applications don't behave 'normally', and having anchors on top of anchors is by necessity framework-dependent behaviour.

Perhaps not detecting any further / characters is a good-enough hueristic, though I'm unsure.

A

AA-Turner · 2024-10-21T05:27:34Z

@jayaddison I re-opened this PR during some triage, perhaps you could look again with a verdict? I hesitate to suggest a config option as I don't like adding more of them, but it could be a get-out here and allow us to provide support.

A

rsokolewicz · 2024-10-30T08:16:15Z

Hi! I'm sorry for not replying earlier, I more or less forgot about this.

I agree with the discussion so far. Handling # correctly for all possible urls is difficult, and ideally it should not be the responsibility of sphinx to know about all the possible ways websites/browsers parse the # symbol.

This is not blocking me, so I'm okay if this never gets merged.

I'll reply to other comments later this week.

rsokolewicz added 2 commits May 28, 2023 08:39

only split # if part of anchor

7dd2de3

linter

108bf21

picnixz reviewed May 30, 2023

View reviewed changes

picnixz added the awaiting:decision PR waiting for a consensus from maintainers. label Feb 25, 2024

This comment was marked as duplicate.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix #11446: linkcheck should split url only if # part of anchor #11447

fix #11446: linkcheck should split url only if # part of anchor #11447

rsokolewicz commented May 28, 2023

picnixz left a comment

picnixz May 30, 2023

picnixz May 30, 2023

picnixz May 30, 2023

jayaddison commented Jun 1, 2023

picnixz commented Jun 2, 2023

jayaddison commented Jun 2, 2023

picnixz commented Jun 2, 2023

jayaddison commented Jun 2, 2023

jayaddison commented Jun 2, 2023

picnixz commented Jun 2, 2023

jayaddison commented Jun 2, 2023

picnixz commented Jun 2, 2023

AA-Turner commented Jul 20, 2023

AA-Turner commented Oct 21, 2024

This comment was marked as duplicate.

rsokolewicz commented Oct 30, 2024

fix #11446: linkcheck should split url only if # part of anchor #11447

Are you sure you want to change the base?

fix #11446: linkcheck should split url only if # part of anchor #11447

Conversation

rsokolewicz commented May 28, 2023

picnixz left a comment

Choose a reason for hiding this comment

picnixz May 30, 2023

Choose a reason for hiding this comment

picnixz May 30, 2023

Choose a reason for hiding this comment

picnixz May 30, 2023

Choose a reason for hiding this comment

jayaddison commented Jun 1, 2023

picnixz commented Jun 2, 2023

jayaddison commented Jun 2, 2023

picnixz commented Jun 2, 2023

jayaddison commented Jun 2, 2023

jayaddison commented Jun 2, 2023

picnixz commented Jun 2, 2023

jayaddison commented Jun 2, 2023

picnixz commented Jun 2, 2023

AA-Turner commented Jul 20, 2023

AA-Turner commented Oct 21, 2024

This comment was marked as duplicate.

rsokolewicz commented Oct 30, 2024