add --unstable-feature=regex_link_parsing for a 12% resolve speedup #8488

cosmicexplorer · 2020-06-23T17:37:19Z

TODO

complete the regex parsing implementation and ensure it passes all the tests

Problem

In performing some profiling (with py-spy) for #8448 and #8480, I found that calling pip._internal.models.wheel.Wheel.supported() and pip._internal.index.collector.parse_links() (specifically html5lib.parse()) took up almost the entire time of finding candidates.

I previously worked on this method in #7729, which was an optimization that pex version 1 had made before moving to invoking pip as a subprocess in pex versions >=2.

Solution

Create a new option --unstable-feature=regex_link_parsing to use an experimental parser that uses regular expressions instead of html5lib.

Result

A 12% decrease in the mean time to download tensorflow==1.14.0 and all of its dependencies over 6 runs.

> for i in $(seq 6); do rm -rf tmp/ ~/Library/Caches/pip && time pip download --unstable-feature=resolver --unstable-feature=regex_link_parsing -d tmp/ tensorflow==1.14.0 &>/dev/null; done
13.11s user 3.61s system 76% cpu 21.752 total
12.97s user 3.55s system 85% cpu 19.358 total
12.92s user 3.41s system 85% cpu 19.136 total
12.84s user 3.38s system 85% cpu 19.076 total
12.72s user 3.36s system 86% cpu 18.623 total
12.90s user 3.39s system 85% cpu 18.989 total
> for i in $(seq 6); do rm -rf tmp/ ~/Library/Caches/pip && time pip download --unstable-feature=resolver -d tmp/ tensorflow==1.14.0 &>/dev/null; done
15.32s user 4.00s system 82% cpu 23.333 total
14.87s user 3.48s system 88% cpu 20.844 total
14.83s user 3.60s system 86% cpu 21.310 total
15.21s user 3.62s system 88% cpu 21.372 total
15.41s user 3.78s system 82% cpu 23.381 total
14.83s user 3.40s system 86% cpu 21.020 total

pradyunsg · 2020-06-23T20:23:32Z

I really don't think we should be adding regex-based link parsing to pip -- HTML is based on a context free grammar, and regex can only model/match regular languages (which is strictly a subset of CFGs).

There's a lot of "don't use regex for processing [X]HTML" content on the internet (for good reason). Here's one that comes to mind easily: https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

cosmicexplorer · 2020-06-23T20:40:14Z

First off, should I have created an issue for this instead of a draft PR first?

Second: Yes! Everything you said is totally correct! One thing that I find is often overlooked is that regular expressions + conditional logic and/or state can create a parser that can parse languages equivalent to CFGs and above. This is how emacs's js2-mode is able to parse javascript to provide semantic-aware highlighting.

The above bit about linguistic power with conditional logic is what wikipedia would call "original research", but I believe it is true. I am actually also working on publishing a novel parsing algorithm which is highly parallelizable and fully general in the background right now.

The one concern I have for this being feasible long-term is that html parsers like html5lib will handle broken HTML, which would be hard to replicate, but:

This feature can be kept as an unstable option for a while, with the explicit requirement that all HTML fed to pip must be well-formed (this is easier to ensure if you're resolving against your own repos instead of PyPI).
I think this regex parser will also be able to handle broken html (which is not a CFG), because adding conditional logic to regex matching allows us to achieve even greater linguistic power than a CFG parser could.

If this was able to pass the same test suite as the html5lib parser, and there were many more tests added for things like broken HTML, could this feature ever be considered for merging (even under an unstable flag)?

pfmoore · 2020-06-23T21:42:54Z

One concern I'd have is maintainability. If we were to adopt this code, it would in future be supported, and quite likely modified, by people who are not well-versed in the intricacies and risks of correctly parsing (possibly broken) HTML code. So there's a very real chance of bitrot. And describing this algorithm as "original research" doesn't do much to alleviate that concern 🙂

One of the maybe less obvious advantages of pip relying on external libraries (even if we have to vendor them) is that we delegate the expertise in the respective subject areas onto those downstream library maintainers. How viable would it be to split this code out into a reusable library? Maybe a dedicated PEP 503 parser? Then pip could relatively safely vendor that package without taking on the maintenance of that code.

A 12% speed increase is not something we should ignore lightly, but speed is very definitely only one consideration out of many here.

pradyunsg · 2020-06-23T22:20:25Z

FWIW, there is a dedicated PEP 503 parser library -- https://github.com/brettcannon/mousebender. Adopting it is blocked on dropping Python 2 support.

brettcannon/mousebender#20

cosmicexplorer · 2020-06-24T00:19:37Z

@pfmoore thank you for your thoughtful and detailed response! :D

So there's a very real chance of bitrot. And describing this algorithm as "original research" doesn't do much to alleviate that concern 🙂

Yep! I don't think this idea is really baked enough to work out right now. Thank you so much for talking it through with me!! ^_^

@pradyunsg: that dedicated parser library sounds like it would be a great place to focus my efforts in line with @pfmoore's comment. I do think a dedicated parser library is exactly where I would expect this kind of performance-sensitive code to be.

cosmicexplorer added 2 commits June 23, 2020 10:18

add --unstable-feature=regex_link_parsing to go faster

4078e52

add NEWS

922ce86

cosmicexplorer force-pushed the parse-links-faster branch from f2bf236 to 922ce86 Compare June 23, 2020 17:40

cosmicexplorer closed this Jun 24, 2020

github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add --unstable-feature=regex_link_parsing for a 12% resolve speedup #8488

add --unstable-feature=regex_link_parsing for a 12% resolve speedup #8488

cosmicexplorer commented Jun 23, 2020 •

edited

Loading

pradyunsg commented Jun 23, 2020

cosmicexplorer commented Jun 23, 2020 •

edited

Loading

pfmoore commented Jun 23, 2020

pradyunsg commented Jun 23, 2020 •

edited

Loading

cosmicexplorer commented Jun 24, 2020

add --unstable-feature=regex_link_parsing for a 12% resolve speedup #8488

add --unstable-feature=regex_link_parsing for a 12% resolve speedup #8488

Conversation

cosmicexplorer commented Jun 23, 2020 • edited Loading

TODO

Problem

Solution

Result

pradyunsg commented Jun 23, 2020

cosmicexplorer commented Jun 23, 2020 • edited Loading

pfmoore commented Jun 23, 2020

pradyunsg commented Jun 23, 2020 • edited Loading

cosmicexplorer commented Jun 24, 2020

cosmicexplorer commented Jun 23, 2020 •

edited

Loading

cosmicexplorer commented Jun 23, 2020 •

edited

Loading

pradyunsg commented Jun 23, 2020 •

edited

Loading