Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(livecheck/pypi): update to use json endpoint to query version #18895

Merged
merged 3 commits into from
Dec 8, 2024

Conversation

chenrui333
Copy link
Member

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes? Here's an example.
  • Have you successfully run brew style with your changes locally?
  • Have you successfully run brew typecheck with your changes locally?
  • Have you successfully run brew tests with your changes locally?

Right now, pypi strategy is crawling the page for checking the latest version, but we should just use json to find out the latest version.

$ curl -s https://pypi.org/pypi/mkvtomp4/json | jq -r '.info.version'
2.0

@chenrui333 chenrui333 force-pushed the update-to-use-json-endpoint branch from 459350b to 67b555f Compare December 7, 2024 06:35
@chenrui333 chenrui333 force-pushed the update-to-use-json-endpoint branch from 67b555f to c65498d Compare December 7, 2024 06:42
@chenrui333 chenrui333 force-pushed the update-to-use-json-endpoint branch from c65498d to d49e01b Compare December 7, 2024 07:00
@cho-m
Copy link
Member

cho-m commented Dec 7, 2024

May be able to reuse Json.find_versions (replacing PageMatch.find_versions) by

  • updating url to be https://pypi.org/pypi/#{T.must(match[:package_name]).gsub(/%20|_/, "-")}/json
  • using a default block that runs json.dig("info", "version") (may need to check if any nil handling is needed)

Copy link
Member

@samford samford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide context for this PR: some upstream PyPI project pages now involve a Fastly verification step before the page is page is loaded (e.g., https://pypi.org/pypi/fred-py-api for me). In this case, the HTML response for the project page URL (that the Pypi strategy checks) only contains a noscript "JavaScript is disabled in your browser" message along with some JavaScript. This appears to occur with some projects and not others, so the existing strategy logic still works in some cases (e.g., cryptography for me). From limited testing, the JSON API doesn't involve the same JavaScript verification (as expected).

Up to now, we've scraped the project HTML because it provides the version information we need but requires considerably less data transfer than the JSON API. Using cryptography as an example, the HTML page is 25 KB compressed (369 KB uncompressed) and the JSON API response is 376 KB compressed (2 MB uncompressed). We have 437 formulae using Pypi, so that difference adds up (e.g., in autobump, --tap homebrew/core runs, etc.). I agree that scraping the HTML page isn't ideal (and that upstream would probably prefer we use the API) but it was a practical choice when the JSON API involves over 10x more data transfer and grows with each release much more rapidly than the HTML page (i.e., the JSON contains lengthy metadata on all the files for every release ever made).

It would be ideal if upstream provided a /pypi/<project>/latest/json API endpoint that would redirect to the JSON API URL for the latest non-yanked version. With that setup, we could match the version from the Location header without even having to parse the JSON response, so that would be simple, fast, and lightweight. Barring that, removing the releases and urls arrays from the main JSON API response would dramatically reduce the size (that mostly duplicates the Index API functionality anyway).

That said, the JSON API appears to be our only option [that differentiates yanked releases] at this point, so we'll just have to live with the higher data transfer and talk to upstream about options that would work for both us and them.

I did a comparative livecheck run across all the formulae using Pypi and only ~28% are working for me locally (and many without a livecheck block fall back to Git, etc.). I didn't see any failures when testing with the JSON API (using my own implementation of this idea) and the time to check them was comparable. Looking at the uncompressed content size of the JSON responses, thankfully only twelve are larger than 1 MB (and another twelve are larger than 500 KB). It's still a lot more data transfer overall but it could be worse.


In terms of the changes here:

  • This should use a similar approach to the Crates strategy, which uses Strategy#page_content and Json#versions_from_content with a default block.
  • #generate_input_values should only return a :url value (e.g., values[:url] = "https://pypi.org/pypi/#{T.must(match[:package_name]).gsub(/%20|_/, "-")}/json") and the related tests need to be updated.

I'll try to add some review suggestions here where possible (it can be tricky, since GitHub doesn't allow for suggestions across removed lines).

Copy link
Member

@samford samford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't add a suggestion here for the default block, but please add this between NICE_NAME and FILENAME_REGEX:

        # The default `strategy` block used to extract version information when
        # a `strategy` block isn't provided.
        DEFAULT_BLOCK = T.let(proc do |json|
          json.dig("info", "version")
        end.freeze, T.proc.params(
          arg0: T::Hash[String, T.untyped],
        ).returns(T.nilable(String)))

Library/Homebrew/livecheck/strategy/pypi.rb Outdated Show resolved Hide resolved
Library/Homebrew/livecheck/strategy/pypi.rb Outdated Show resolved Hide resolved
Library/Homebrew/livecheck/strategy/pypi.rb Outdated Show resolved Hide resolved
Library/Homebrew/livecheck/strategy/pypi.rb Outdated Show resolved Hide resolved
Library/Homebrew/livecheck/strategy/pypi.rb Outdated Show resolved Hide resolved
Library/Homebrew/livecheck/strategy/pypi.rb Show resolved Hide resolved
Library/Homebrew/livecheck/strategy/pypi.rb Outdated Show resolved Hide resolved
Library/Homebrew/livecheck/strategy/pypi.rb Outdated Show resolved Hide resolved
Library/Homebrew/test/livecheck/strategy/pypi_spec.rb Outdated Show resolved Hide resolved
Library/Homebrew/test/livecheck/strategy/pypi_spec.rb Outdated Show resolved Hide resolved
@samford
Copy link
Member

samford commented Dec 7, 2024

@chenrui333 For what it's worth, I can just push a commit to incorporate these changes, if you're okay with them. I don't want you to have to go through the effort of manually incorporating all these suggestions when I already have a commit ready 😆

@chenrui333
Copy link
Member Author

@chenrui333 For what it's worth, I can just push a commit to incorporate these changes, if you're okay with them. I don't want you to have to go through the effort of manually incorporating all these suggestions when I already have a commit ready 😆

I am totally okay with it :)

@samford samford marked this pull request as ready for review December 7, 2024 17:59
@samford samford force-pushed the update-to-use-json-endpoint branch 2 times, most recently from 36fa958 to 7a4de1e Compare December 7, 2024 18:03
Copy link
Member

@samford samford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit that reworks this to use Json::find_versions in Pypi::find_versions, borrowing some of the approach from the Crate strategy. Besides that, this pares down the fields in the ::generate_input_values return hash to only :url, as we're not using a generated regex to match version information in this setup.

I've added a provided_content parameter to ::find_versions as part of this process and I will expand the Pypi tests to increase coverage (like the Crates tests and others where we fetch content in ::find_versions) in a later PR. 75% of Pypi checks are failing at the moment (with some returning inaccurate version information), so the current priority is getting this fix merged in the short-term.

I'll be busy for the next few hours and I'll come back to this in the evening but I think this is most of what I wanted to address before merging.

@Agtxcjr

This comment has been minimized.

@samford samford force-pushed the update-to-use-json-endpoint branch 2 times, most recently from 5e3696b to e71b86e Compare December 8, 2024 01:01
This reworks the new `Pypi` JSON API implementation to use
`Json::find_versions` in `Pypi::find_versions`, borrowing some of the
approach from the `Crate` strategy.

Besides that, this pares down the fields in the
`::generate_input_values` return hash to only `:url`, as we're not
using a generated regex to match version information in this setup.

This adds a `provided_content` parameter to `::find_versions` as part
of this process and I will expand the `Pypi` tests to increase
coverage (like the `Crates` tests) in a later PR. 75% of `Pypi` checks
are failing at the moment (with some returning inaccurate version
information), so the current priority is getting this fix merged in
the short-term.
@samford samford force-pushed the update-to-use-json-endpoint branch from e71b86e to 1c161ab Compare December 8, 2024 01:42
Among other things, the previous commit added a `provided_content`
paramter to `Pypi::find_versions`, so this takes advantage of that to
expand `Pypi` test coverage to 100%.
@samford samford force-pushed the update-to-use-json-endpoint branch from 1c161ab to ac4854e Compare December 8, 2024 01:45
Copy link
Member

@samford samford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another pass through this to tweak a few things and added a commit that increases Pypi test coverage to 100% (for lines and branches). One thing to note is that we're able to use Json::find_versions here instead of Json::versions_from_content, which allows us to minimize the required logic in Pypi::find_versions (much like we do for strategies that use PageMatch::find_versions, like Apache). [I forget if there was a specific reason why I didn't do the same thing in the Crate strategy but I'll create a follow-up PR for that if it's possible.]

I think this is in good shape now, so we just need someone to review these changes to double-check our work. I've tested this again across the existing Pypi formulae and it continues to work as expected.

Copy link
Member

@nandahkrishna nandahkrishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@samford
Copy link
Member

samford commented Dec 8, 2024

Thanks, @nandahkrishna! I don't anticipate this causing issues overnight based on the testing I've done, so I'm going to go ahead and merge this to fix the related checks.

Thanks @chenrui333 for getting this started and @cho-m for digging into this so deeply and resolving my questions ❤️

@samford samford merged commit b13a4c5 into Homebrew:master Dec 8, 2024
28 checks passed
@chenrui333 chenrui333 deleted the update-to-use-json-endpoint branch December 8, 2024 04:29
@chenrui333
Copy link
Member Author

Thanks @samford @cho-m @nandahkrishna! ❤️

@samford samford mentioned this pull request Dec 8, 2024
7 tasks
@MikeMcQuaid
Copy link
Member

Thanks @chenrui333 @samford @cho-m @nandahkrishna!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants