-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #11446: linkcheck should split url only if # part of anchor #11447
base: master
Are you sure you want to change the base?
fix #11446: linkcheck should split url only if # part of anchor #11447
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments.
anchor = None | ||
break | ||
uri_pattern = r"(.+/[^/]*)#([^/]*?/?$)" | ||
match = re.search(uri_pattern, uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For efficiency reason, I would suggest declaring the pattern in the global scope (or the class scope) and compiling it.
req_url, anchor = uri, None | ||
|
||
for rex in self.anchors_ignore: | ||
if anchor and rex.match(anchor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anchor
can be checked outside of the loop:
if anchor:
for rex in self.anchors_ignore:
if rex.match(anchor):
anchor = None
break
@@ -728,3 +733,44 @@ def test_linkcheck_exclude_documents(app): | |||
'info': 'br0ken_link matched br[0-9]ken_link from linkcheck_exclude_documents', | |||
}, | |||
] | |||
|
|||
|
|||
urls = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advise not to use a global declaration here and simply write
@pytest.mark.parametrize("test_url", [
"...."
])
Also, maybe rename "123" to "anchor" so that we know that it is an anchor.
I'm not a regular contributor here, so please take my feedback with a grain of salt. My concern about this change isn't related to the code, but to standards (in particular RFC3986). I don't think that handling the URL fragments this way would match the RFC3986 specification, and I think (although could be wrong) that the >>> import rfc3986
>>> rfc3986.urlparse("https://www.example.com/path/to/#/resource#123")
ParseResult(scheme='https', userinfo=None, host='www.example.com', port=None, path='/path/to/', query=None, fragment='/resource#123') If we think that's a bug with Supporting that kind of thing is often fine short-term to solve an individual case, but over a longer duration of time it can lead to standards fragmentation and resulting maintenance burden, confusion and compatibility problems. I'll try to think of ways that we could address this and also handle the use case -- and perhaps that'll involve speaking to the (again, my contributions and involvement here are fairly minimal, so feel free to ignore all of the above) |
@jayaddison pyright is not doing anything "wrong" per se. The use of While I commented the PR without considering RFC3986, I agree that it might be better to enforce RFC3986 by default. We could, however, hard-detect whether |
Yep, I've seen a few single-page applications use similar techniques, and since they can inspect the current URL from within their JavaScript code, they can implement patterns like this. We don't have a JavaScript runtime within Sphinx, though, and it wouldn't seem great to follow a path of trying to implement edge case handling for all possible application configurations (that's the path I think that Sphinx should avoid, while considering alternative routes to achieve everyone's desired goals. I have a vague memory that there have been attempts to specify this in the past to make fragment-URLs crawlable by search engines). |
What about hard-detecting |
Using Where possible I think it's better to determine the objective, discuss that widely, gather feedback on ideas and problems, perhaps prototype an approach, and codify a standard (and then vote and agree upon it). It may seem like the standardisation process can take a long time, but I think that in the long term it's a more respectful approach and saves time (because less work is wasted, and systems become interoperable). |
(or we end up coming back here and saying 'nope, sorry - we changed our minds as maintainers and we don't want to support |
Fair enough. Interestingly,
points to the same URL as
|
Everything after the first fragment identifier ( |
Don't worry you are correct. I only wanted to point this fact for (possible) future discussion. |
This has conflicts (#11432). The points raised in discussion are hard to resolve, as single-page applications don't behave 'normally', and having anchors on top of anchors is by necessity framework-dependent behaviour. Perhaps not detecting any further A |
@jayaddison I re-opened this PR during some triage, perhaps you could look again with a verdict? I hesitate to suggest a config option as I don't like adding more of them, but it could be a get-out here and allow us to provide support. A |
1 similar comment
This comment was marked as duplicate.
This comment was marked as duplicate.
Hi! I'm sorry for not replying earlier, I more or less forgot about this. I agree with the discussion so far. Handling # correctly for all possible urls is difficult, and ideally it should not be the responsibility of sphinx to know about all the possible ways websites/browsers parse the # symbol. This is not blocking me, so I'm okay if this never gets merged. I'll reply to other comments later this week. |
fixes #11446
as described in the issue, the link checker incorrectly processes URLs that have a
#
that is part of the directory path, rather than designating the anchor.This pull request fixes this by splitting the URL using a regexp
(.+/[^/]*)#([^/]*?/?$)
, which is a complicated way of saying we want to split the URL by#
only if it is not followed by more forward slashes.I've tested the implementation by hand on the URL
https://microsoft.github.io/pyright/#/type-concepts-advanced
as described in the issue and it works. So that nothing breaks in the future, I added a test that ensures that the following two URLs:the anchor "123" is split off correctly. Although the test works, it's quite technical with the mockers that I used. If you guys have suggestions on how to improve, I'm all ears.