Break HTMLPage apart, extract some of it into pip._internal.repositories #5822

uranusjr · 2018-09-27T16:49:51Z

What I really want to do for #5800, part 1 of many.

The end goal is to produce a repositories package that can be used on its own, maybe as part of packaging, or a standalone package. This involves taking parts of PackageFinder and HTMLPage and shuffle them.

I start with HTMLPage since it is completely internal to index.py. It is used strictly as a temporary object to generate links. An instance is created by making a request (or two), and parsing the content. The object is then iterated through almost immediately to generate link objects. So I made some refactoring, turn the class into a plain generator function, and was able to extract HTML-parsing parts into repositories. It might seem arbitrary what went into repositories and what didn’t, but I promise there is logic behind it, and it’ll show later (I hope).

There is a lot going on in this PR, but I feel I have documented my steps quite clearly in commit messages. It should be quite straightforward to make sense of each commit on its own (I move only one or two things around in each commit). These are also all tied together, so I can’t submit them in parallel. Please do tell me if you feel I should make a split at a certain point.

Parts of index.html will be moved into this subpackage, piece by piece. The goal is eventually to turn this into a subpackage of packaging.

This property is only used in HTMLPage.links, which is only called once per instance in PackageFinder.find_all_candidates(). This would not affect performance or behavior, but improves data locality. The unit test of this property is moved and modified to test the underlying function instead.

This attribute is only used by HTMLPage.links, which is only used once per instance in PackageFinder.find_all_candidates(), so this change does not affect performance or behavior, but improves data locality.

This is not used anywhere.

Three methods moved: * clean_link (Also _link_clean_re, which is only used by it) * _handle_fail * _get_content_type This does affect performance (aminimally), but is better for locality. The functions being local would help refactoring HTMLPage.

This argument (comes_from) is only used for reporting, and it is enough to pass in only the URL since that's what is actually used.

It is only used once inside find_all_candidates(). This is a step toward restructuring the whole finder. PackageFinder._get_page() is not touched since it is a mock point in unit tests.

At this point HTMLPage's lifetime is limited in that function.

At this point HTMLPage doesn't really do anything anymore. All it does is being instantiated and immediately iterated through. This can be easily reduced into a flat generator function, iter_links().

uranusjr · 2018-09-27T18:15:56Z

I don’t understand why mypy has trouble picking up six.moves.html_parser :(

cjerdonek · 2018-09-27T18:41:32Z

What is the scope of the planned package exactly, and why is it called "repositories"? Is this a different meaning from source control repository I take it?

Re: splitting it up, if it were me each PR would be the smallest stand-alone unit. The main reason is that I think it's more courteous to reviewers. I've also found that, in my experience, even easier patches of mine have taken some time to be reviewed and merged. So I want to lower the bar as much as possible for review.

cjerdonek · 2018-09-27T18:51:55Z

src/pip/_internal/index.py

-            if url.lower().startswith(scheme) and url[len(scheme)] in '+:':
-                logger.debug('Cannot look at %s URL %s', scheme, link)
-                return None
+def _iter_links(link, session):


It looks like this is a 100+ line function. I would break it up some. For example, I think it would be good to at least separate the part making the request from the part that parses the content and loops over it. It looks like this separation perhaps existed even in the prior code?

cjerdonek · 2018-09-27T18:55:48Z

src/pip/_internal/repositories/entries.py

+    return URL_CLEAN_RE.sub(lambda match: '%%%2x' % ord(match.group(0)), url)
+
+
+def _iter_anchor_data(document, base_url):


For functions like this where you're looping, I think it would be good to make the inside of the loop its own function (something like parse_anchor_element() in this case). That makes it easier to test in isolation the functionality that's being called multiple times.

uranusjr · 2018-09-28T05:46:10Z

Thanks for the comments. I think I’m going to do this in four parts:

a. Refactor HTMLPage and localise things on it.
b. Refactor steps in smaller functions to prepare for the next step.
c. Flatten HTMLPage into an iterator that calls smaller functions made in 2.
d. Maybe move things.

I got the repositories name from @dstufft’s packaging.repositories works. It is also the name used in PEP 503 to refer a simple API service. I chose it since I don’t really have a good name for this, and the name makes sense to me. I’d be happy to switch to anything else if it makes more sense.

IMO “index” might need to be avoided since it is commonly used to only refer to a PEP 503 service, not find-links entities, but I’d be fine if that is actually preferred. Another possible name would be sources (Pipfile uses this), but it makes me think about source code.

I’m closing this to split commits into smaller PRs.

cjerdonek · 2018-09-28T06:40:43Z

Okay, thanks!

cjerdonek · 2018-09-28T06:55:39Z

src/pip/_internal/repositories/entries.py

+        implementation expects the result of ``html5lib.parse()``.
+    :param page_url: The URL of the HTML document.
+    """
+    bases = [


It might be more elegant to use a for-else here with a break if the condition is met. That way you stop at the first valid base and don't need to do bases[0], etc.

Agreed. I copied existing pip code to create this function. It would be a good idea to clean it up a bit.

lock · 2019-05-31T20:41:21Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

uranusjr added 3 commits September 28, 2018 00:35

Add repositories subpackage

c9b78fb

Parts of index.html will be moved into this subpackage, piece by piece. The goal is eventually to turn this into a subpackage of packaging.

Extract egg_info_matches

95e0838

Extract base URL parsing logic

0d52e06

uranusjr force-pushed the htmlpage-extract branch 2 times, most recently from 68d14c3 to b2f15f8 Compare September 27, 2018 17:22

uranusjr added 6 commits September 28, 2018 01:27

Convert HTMLPage.parsed into a local variable

4ee6cce

This attribute is only used by HTMLPage.links, which is only used once per instance in PackageFinder.find_all_candidates(), so this change does not affect performance or behavior, but improves data locality.

Remove HTMLPage.headers attribute

ab88fc9

This is not used anywhere.

Extract encoding parsing out of HTMLPage

e82395f

Pass the URL instead of the page itself into Link

8bfc31d

This argument (comes_from) is only used for reporting, and it is enough to pass in only the URL since that's what is actually used.

uranusjr force-pushed the htmlpage-extract branch 2 times, most recently from bf6fc8b to d0b5679 Compare September 27, 2018 17:32

uranusjr added 5 commits September 28, 2018 01:55

Inline PackageFinder._get_pages()

ee3b9e4

It is only used once inside find_all_candidates(). This is a step toward restructuring the whole finder. PackageFinder._get_page() is not touched since it is a mock point in unit tests.

Extract HTMLPage creation and access into function

64600f7

At this point HTMLPage's lifetime is limited in that function.

Flatten HTMLPage into a function

2e926a2

At this point HTMLPage doesn't really do anything anymore. All it does is being instantiated and immediately iterated through. This can be easily reduced into a flat generator function, iter_links().

Move HTML parsing into repositories

a61f11d

News

67795c3

uranusjr force-pushed the htmlpage-extract branch 2 times, most recently from bf55441 to 67795c3 Compare September 27, 2018 18:14

cjerdonek reviewed Sep 27, 2018

View reviewed changes

uranusjr closed this Sep 28, 2018

cjerdonek reviewed Sep 28, 2018

View reviewed changes

This was referenced Sep 28, 2018

Turn attributes on HTMLPage into local variables and standalone functions #5826

Merged

Move static methods out of HTMLPage #5833

Merged

cjerdonek added the C: finder PackageFinder and index related code label Oct 1, 2018

uranusjr mentioned this pull request Oct 1, 2018

Refactor _get_html_page() to use exceptions for flow control #5838

Merged

lock bot added the auto-locked Outdated issues that have been locked by automation label May 31, 2019

lock bot locked as resolved and limited conversation to collaborators May 31, 2019

uranusjr deleted the htmlpage-extract branch September 28, 2020 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break HTMLPage apart, extract some of it into pip._internal.repositories #5822

Break HTMLPage apart, extract some of it into pip._internal.repositories #5822

uranusjr commented Sep 27, 2018 •

edited

Loading

uranusjr commented Sep 27, 2018

cjerdonek commented Sep 27, 2018

cjerdonek Sep 27, 2018

cjerdonek Sep 27, 2018

uranusjr commented Sep 28, 2018 •

edited

Loading

cjerdonek commented Sep 28, 2018

cjerdonek Sep 28, 2018

uranusjr Sep 28, 2018

lock bot commented May 31, 2019

		return URL_CLEAN_RE.sub(lambda match: '%%%2x' % ord(match.group(0)), url)


		def _iter_anchor_data(document, base_url):

Break HTMLPage apart, extract some of it into pip._internal.repositories #5822

Break HTMLPage apart, extract some of it into pip._internal.repositories #5822

Conversation

uranusjr commented Sep 27, 2018 • edited Loading

uranusjr commented Sep 27, 2018

cjerdonek commented Sep 27, 2018

cjerdonek Sep 27, 2018

Choose a reason for hiding this comment

cjerdonek Sep 27, 2018

Choose a reason for hiding this comment

uranusjr commented Sep 28, 2018 • edited Loading

cjerdonek commented Sep 28, 2018

cjerdonek Sep 28, 2018

Choose a reason for hiding this comment

uranusjr Sep 28, 2018

Choose a reason for hiding this comment

lock bot commented May 31, 2019

uranusjr commented Sep 27, 2018 •

edited

Loading

uranusjr commented Sep 28, 2018 •

edited

Loading