Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vendor charset-normalizer instead of chardet #12638

Merged
merged 2 commits into from
May 6, 2024

Conversation

bluetech
Copy link
Contributor

Description

The requests library which pip vendors supports either chardet or charset-normalizer. Currently pip vendors chardet, however charset-normalizer has better import time than chardet, which improves pip's startup time. So this PR switches to it.

As for the difference between the libraries, i.e. which is better, I didn't really try to ascertain that, because I'm mostly assuming that the charset-detection feature of requests is not really utilized by pip so it doesn't really matter. However given requests supports charset-normalizer I assume it's fit for purpose.

The author of charset-normalizer agreed to pip vendoring: jawah/charset_normalizer#456

Refs #4768 (comment)

Note on charset-normalizer wheel

pip requires pure-python wheels, but charset-normalizer offers both any and platform-specific wheel with some optimizations. The current latest version of vendoring, 1.2.0, installs the platform-specific wheel. This was changed in pradyunsg/vendoring@a8706ae, but this commit is not released yet.

In order not to rely on an unreleased version of vendoring, I let it install the platform-specific wheel but remove the .so files in the vendoring config in pyproject.toml. If a new version of vendoring is released these lines can be removed.

See: pradyunsg/vendoring#62

How I created the PR

  1. Applied the existing requests.patch to a local checkout of requests 2.31.0 (the version currently vendored by pip)
  2. Made the changes to switch to charset-normalizer
  3. Regenerated requests.patch
  4. Ran vendoring sync . -v using vendoring version 1.2.0.

Therefore to verify the huge diff in this PR, you can:

  • Review the changes to requests.patch and pyproject.toml
  • Checkout this PR, run vendoring sync . -v and see there is no diff

@pfmoore pfmoore added this to the 24.1 milestone May 3, 2024
@pradyunsg pradyunsg force-pushed the charset-normalizer branch from 097707f to 2532ac6 Compare May 6, 2024 16:40
@pradyunsg pradyunsg merged commit 47d0dc5 into pypa:main May 6, 2024
28 checks passed
@nateprewitt
Copy link
Member

A little late to the conversation on this, we may consider removing chardet/charset-normalizer entirely from the pip vendor. At first glance, it shouldn't materially affect the function of Requests in pip and bypasses the import entirely which should be objectively better than swapping out dependencies.

@bluetech bluetech deleted the charset-normalizer branch May 14, 2024 15:05
@bluetech
Copy link
Contributor Author

@nateprewitt I think that would be great. Generally how would you suggest I go about verifying that charset-normalizer is definitely not useful for pip?

@nateprewitt
Copy link
Member

nateprewitt commented May 14, 2024

Both charset-normalizer and chardet are only used for the text API in Requests. We've had a long running discussion on making this behavior optional, and while we can't remove it from the default Requests installation, we can support the dependencies not being present.

I'd like to get some feedback from other pip maintainers on their thoughts before doing any work here. This seems like a relatively low-effort improvement since we're already changing around behavior though.

@uranusjr
Copy link
Member

Hmm, yeah we don’t really need charset detection anywhere. If reqeusts supports the library not being present, that’s an easy win (assuming requests is willing to support it). I’d say it’s worthwhile even if pip needs to do some patching; we already need to patch requests anyway.

@nateprewitt
Copy link
Member

I put up a quick PR for what this might look like in Requests (psf/requests#6702). I'll wait to see if Ian comes back with any concerns. If he's bought in, we can look at porting it here. I don't see this as being particularly difficult for us to support though.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants