-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Break HTMLPage apart, extract some of it into pip._internal.repositories #5822
Conversation
Parts of index.html will be moved into this subpackage, piece by piece. The goal is eventually to turn this into a subpackage of packaging.
68d14c3
to
b2f15f8
Compare
This property is only used in HTMLPage.links, which is only called once per instance in PackageFinder.find_all_candidates(). This would not affect performance or behavior, but improves data locality. The unit test of this property is moved and modified to test the underlying function instead.
This attribute is only used by HTMLPage.links, which is only used once per instance in PackageFinder.find_all_candidates(), so this change does not affect performance or behavior, but improves data locality.
This is not used anywhere.
Three methods moved: * clean_link (Also _link_clean_re, which is only used by it) * _handle_fail * _get_content_type This does affect performance (aminimally), but is better for locality. The functions being local would help refactoring HTMLPage.
This argument (comes_from) is only used for reporting, and it is enough to pass in only the URL since that's what is actually used.
bf6fc8b
to
d0b5679
Compare
It is only used once inside find_all_candidates(). This is a step toward restructuring the whole finder. PackageFinder._get_page() is not touched since it is a mock point in unit tests.
At this point HTMLPage's lifetime is limited in that function.
At this point HTMLPage doesn't really do anything anymore. All it does is being instantiated and immediately iterated through. This can be easily reduced into a flat generator function, iter_links().
bf55441
to
67795c3
Compare
I don’t understand why mypy has trouble picking up |
What is the scope of the planned package exactly, and why is it called "repositories"? Is this a different meaning from source control repository I take it? Re: splitting it up, if it were me each PR would be the smallest stand-alone unit. The main reason is that I think it's more courteous to reviewers. I've also found that, in my experience, even easier patches of mine have taken some time to be reviewed and merged. So I want to lower the bar as much as possible for review. |
if url.lower().startswith(scheme) and url[len(scheme)] in '+:': | ||
logger.debug('Cannot look at %s URL %s', scheme, link) | ||
return None | ||
def _iter_links(link, session): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is a 100+ line function. I would break it up some. For example, I think it would be good to at least separate the part making the request from the part that parses the content and loops over it. It looks like this separation perhaps existed even in the prior code?
return URL_CLEAN_RE.sub(lambda match: '%%%2x' % ord(match.group(0)), url) | ||
|
||
|
||
def _iter_anchor_data(document, base_url): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For functions like this where you're looping, I think it would be good to make the inside of the loop its own function (something like parse_anchor_element()
in this case). That makes it easier to test in isolation the functionality that's being called multiple times.
Thanks for the comments. I think I’m going to do this in four parts: a. Refactor HTMLPage and localise things on it. I got the IMO “index” might need to be avoided since it is commonly used to only refer to a PEP 503 service, not find-links entities, but I’d be fine if that is actually preferred. Another possible name would be I’m closing this to split commits into smaller PRs. |
Okay, thanks! |
implementation expects the result of ``html5lib.parse()``. | ||
:param page_url: The URL of the HTML document. | ||
""" | ||
bases = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be more elegant to use a for-else here with a break
if the condition is met. That way you stop at the first valid base and don't need to do bases[0]
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I copied existing pip code to create this function. It would be a good idea to clean it up a bit.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
What I really want to do for #5800, part 1 of many.
The end goal is to produce a
repositories
package that can be used on its own, maybe as part ofpackaging
, or a standalone package. This involves taking parts of PackageFinder and HTMLPage and shuffle them.I start with HTMLPage since it is completely internal to
index.py
. It is used strictly as a temporary object to generate links. An instance is created by making a request (or two), and parsing the content. The object is then iterated through almost immediately to generate link objects. So I made some refactoring, turn the class into a plain generator function, and was able to extract HTML-parsing parts intorepositories
. It might seem arbitrary what went intorepositories
and what didn’t, but I promise there is logic behind it, and it’ll show later (I hope).There is a lot going on in this PR, but I feel I have documented my steps quite clearly in commit messages. It should be quite straightforward to make sense of each commit on its own (I move only one or two things around in each commit). These are also all tied together, so I can’t submit them in parallel. Please do tell me if you feel I should make a split at a certain point.