Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break HTMLPage apart, extract some of it into pip._internal.repositories #5822

Closed
wants to merge 14 commits into from

Conversation

uranusjr
Copy link
Member

@uranusjr uranusjr commented Sep 27, 2018

What I really want to do for #5800, part 1 of many.

The end goal is to produce a repositories package that can be used on its own, maybe as part of packaging, or a standalone package. This involves taking parts of PackageFinder and HTMLPage and shuffle them.

I start with HTMLPage since it is completely internal to index.py. It is used strictly as a temporary object to generate links. An instance is created by making a request (or two), and parsing the content. The object is then iterated through almost immediately to generate link objects. So I made some refactoring, turn the class into a plain generator function, and was able to extract HTML-parsing parts into repositories. It might seem arbitrary what went into repositories and what didn’t, but I promise there is logic behind it, and it’ll show later (I hope).

There is a lot going on in this PR, but I feel I have documented my steps quite clearly in commit messages. It should be quite straightforward to make sense of each commit on its own (I move only one or two things around in each commit). These are also all tied together, so I can’t submit them in parallel. Please do tell me if you feel I should make a split at a certain point.

Parts of index.html will be moved into this subpackage, piece by piece.
The goal is eventually to turn this into a subpackage of packaging.
@uranusjr uranusjr force-pushed the htmlpage-extract branch 2 times, most recently from 68d14c3 to b2f15f8 Compare September 27, 2018 17:22
This property is only used in HTMLPage.links, which is only called once
per instance in PackageFinder.find_all_candidates(). This would not
affect performance or behavior, but improves data locality.

The unit test of this property is moved and modified to test the
underlying function instead.
This attribute is only used by HTMLPage.links, which is only used once
per instance in PackageFinder.find_all_candidates(), so this change does
not affect performance or behavior, but improves data locality.
This is not used anywhere.
Three methods moved:

* clean_link (Also _link_clean_re, which is only used by it)
* _handle_fail
* _get_content_type

This does affect performance (aminimally), but is better for locality.
The functions being local would help refactoring HTMLPage.
This argument (comes_from) is only used for reporting, and it is enough
to pass in only the URL since that's what is actually used.
@uranusjr uranusjr force-pushed the htmlpage-extract branch 2 times, most recently from bf6fc8b to d0b5679 Compare September 27, 2018 17:32
It is only used once inside find_all_candidates(). This is a step
toward restructuring the whole finder.

PackageFinder._get_page() is not touched since it is a mock point in
unit tests.
At this point HTMLPage's lifetime is limited in that function.
At this point HTMLPage doesn't really do anything anymore. All it does
is being instantiated and immediately iterated through. This can be
easily reduced into a flat generator function, iter_links().
@uranusjr uranusjr force-pushed the htmlpage-extract branch 2 times, most recently from bf55441 to 67795c3 Compare September 27, 2018 18:14
@uranusjr
Copy link
Member Author

I don’t understand why mypy has trouble picking up six.moves.html_parser :(

@cjerdonek
Copy link
Member

What is the scope of the planned package exactly, and why is it called "repositories"? Is this a different meaning from source control repository I take it?

Re: splitting it up, if it were me each PR would be the smallest stand-alone unit. The main reason is that I think it's more courteous to reviewers. I've also found that, in my experience, even easier patches of mine have taken some time to be reviewed and merged. So I want to lower the bar as much as possible for review.

if url.lower().startswith(scheme) and url[len(scheme)] in '+:':
logger.debug('Cannot look at %s URL %s', scheme, link)
return None
def _iter_links(link, session):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is a 100+ line function. I would break it up some. For example, I think it would be good to at least separate the part making the request from the part that parses the content and loops over it. It looks like this separation perhaps existed even in the prior code?

return URL_CLEAN_RE.sub(lambda match: '%%%2x' % ord(match.group(0)), url)


def _iter_anchor_data(document, base_url):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For functions like this where you're looping, I think it would be good to make the inside of the loop its own function (something like parse_anchor_element() in this case). That makes it easier to test in isolation the functionality that's being called multiple times.

@uranusjr
Copy link
Member Author

uranusjr commented Sep 28, 2018

Thanks for the comments. I think I’m going to do this in four parts:

a. Refactor HTMLPage and localise things on it.
b. Refactor steps in smaller functions to prepare for the next step.
c. Flatten HTMLPage into an iterator that calls smaller functions made in 2.
d. Maybe move things.

I got the repositories name from @dstufft’s packaging.repositories works. It is also the name used in PEP 503 to refer a simple API service. I chose it since I don’t really have a good name for this, and the name makes sense to me. I’d be happy to switch to anything else if it makes more sense.

IMO “index” might need to be avoided since it is commonly used to only refer to a PEP 503 service, not find-links entities, but I’d be fine if that is actually preferred. Another possible name would be sources (Pipfile uses this), but it makes me think about source code.

I’m closing this to split commits into smaller PRs.

@uranusjr uranusjr closed this Sep 28, 2018
@cjerdonek
Copy link
Member

Okay, thanks!

implementation expects the result of ``html5lib.parse()``.
:param page_url: The URL of the HTML document.
"""
bases = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more elegant to use a for-else here with a break if the condition is met. That way you stop at the first valid base and don't need to do bases[0], etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I copied existing pip code to create this function. It would be a good idea to clean it up a bit.

@cjerdonek cjerdonek added the C: finder PackageFinder and index related code label Oct 1, 2018
@lock
Copy link

lock bot commented May 31, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label May 31, 2019
@lock lock bot locked as resolved and limited conversation to collaborators May 31, 2019
@uranusjr uranusjr deleted the htmlpage-extract branch September 28, 2020 14:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation C: finder PackageFinder and index related code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants