Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add --unstable-feature=regex_link_parsing for a 12% resolve speedup #8488

Closed
wants to merge 2 commits into from

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Jun 23, 2020

TODO

  • complete the regex parsing implementation and ensure it passes all the tests

Problem

In performing some profiling (with py-spy) for #8448 and #8480, I found that calling pip._internal.models.wheel.Wheel.supported() and pip._internal.index.collector.parse_links() (specifically html5lib.parse()) took up almost the entire time of finding candidates.

I previously worked on this method in #7729, which was an optimization that pex version 1 had made before moving to invoking pip as a subprocess in pex versions >=2.

Solution

  • Create a new option --unstable-feature=regex_link_parsing to use an experimental parser that uses regular expressions instead of html5lib.

Result

A 12% decrease in the mean time to download tensorflow==1.14.0 and all of its dependencies over 6 runs.

> for i in $(seq 6); do rm -rf tmp/ ~/Library/Caches/pip && time pip download --unstable-feature=resolver --unstable-feature=regex_link_parsing -d tmp/ tensorflow==1.14.0 &>/dev/null; done
13.11s user 3.61s system 76% cpu 21.752 total
12.97s user 3.55s system 85% cpu 19.358 total
12.92s user 3.41s system 85% cpu 19.136 total
12.84s user 3.38s system 85% cpu 19.076 total
12.72s user 3.36s system 86% cpu 18.623 total
12.90s user 3.39s system 85% cpu 18.989 total
> for i in $(seq 6); do rm -rf tmp/ ~/Library/Caches/pip && time pip download --unstable-feature=resolver -d tmp/ tensorflow==1.14.0 &>/dev/null; done
15.32s user 4.00s system 82% cpu 23.333 total
14.87s user 3.48s system 88% cpu 20.844 total
14.83s user 3.60s system 86% cpu 21.310 total
15.21s user 3.62s system 88% cpu 21.372 total
15.41s user 3.78s system 82% cpu 23.381 total
14.83s user 3.40s system 86% cpu 21.020 total

@pradyunsg
Copy link
Member

I really don't think we should be adding regex-based link parsing to pip -- HTML is based on a context free grammar, and regex can only model/match regular languages (which is strictly a subset of CFGs).

There's a lot of "don't use regex for processing [X]HTML" content on the internet (for good reason). Here's one that comes to mind easily: https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Jun 23, 2020

First off, should I have created an issue for this instead of a draft PR first?

Second: Yes! Everything you said is totally correct! One thing that I find is often overlooked is that regular expressions + conditional logic and/or state can create a parser that can parse languages equivalent to CFGs and above. This is how emacs's js2-mode is able to parse javascript to provide semantic-aware highlighting.

The above bit about linguistic power with conditional logic is what wikipedia would call "original research", but I believe it is true. I am actually also working on publishing a novel parsing algorithm which is highly parallelizable and fully general in the background right now.

The one concern I have for this being feasible long-term is that html parsers like html5lib will handle broken HTML, which would be hard to replicate, but:

  1. This feature can be kept as an unstable option for a while, with the explicit requirement that all HTML fed to pip must be well-formed (this is easier to ensure if you're resolving against your own repos instead of PyPI).
  2. I think this regex parser will also be able to handle broken html (which is not a CFG), because adding conditional logic to regex matching allows us to achieve even greater linguistic power than a CFG parser could.

If this was able to pass the same test suite as the html5lib parser, and there were many more tests added for things like broken HTML, could this feature ever be considered for merging (even under an unstable flag)?

@pfmoore
Copy link
Member

pfmoore commented Jun 23, 2020

One concern I'd have is maintainability. If we were to adopt this code, it would in future be supported, and quite likely modified, by people who are not well-versed in the intricacies and risks of correctly parsing (possibly broken) HTML code. So there's a very real chance of bitrot. And describing this algorithm as "original research" doesn't do much to alleviate that concern 🙂

One of the maybe less obvious advantages of pip relying on external libraries (even if we have to vendor them) is that we delegate the expertise in the respective subject areas onto those downstream library maintainers. How viable would it be to split this code out into a reusable library? Maybe a dedicated PEP 503 parser? Then pip could relatively safely vendor that package without taking on the maintenance of that code.

A 12% speed increase is not something we should ignore lightly, but speed is very definitely only one consideration out of many here.

@pradyunsg
Copy link
Member

pradyunsg commented Jun 23, 2020

FWIW, there is a dedicated PEP 503 parser library -- https://github.com/brettcannon/mousebender. Adopting it is blocked on dropping Python 2 support.

brettcannon/mousebender#20

@cosmicexplorer
Copy link
Contributor Author

@pfmoore thank you for your thoughtful and detailed response! :D

So there's a very real chance of bitrot. And describing this algorithm as "original research" doesn't do much to alleviate that concern 🙂

Yep! I don't think this idea is really baked enough to work out right now. Thank you so much for talking it through with me!! ^_^

@pradyunsg: that dedicated parser library sounds like it would be a great place to focus my efforts in line with @pfmoore's comment. I do think a dedicated parser library is exactly where I would expect this kind of performance-sensitive code to be.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants