Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always consider metadata as utf-8. #22

Merged
merged 2 commits into from
Feb 20, 2024

Conversation

thatch
Copy link
Contributor

@thatch thatch commented Feb 17, 2024

This is documented near the beginning of
https://packaging.python.org/en/latest/specifications/core-metadata/ and warehouse currently serves these as content-type: application/octet-stream with no encoding, so requests spends a surprising amount of time on what are generally lower-ascii files trying to guess an encoding.

I think this is 100% safe -- files that can't be understood as utf-8 appear to not work in pip, which reads them as bytes, saves them as bytes and importlib always uses utf-8. A better API for pypi-simple might be to return bytes, but that's more disruptive.

Sample debug log pydantic-2.6.1-py3-none-any.whl.metadata showing more than 300ms spent trying to load the thing, only to give up and treat it as utf-8 anyway.

2024-02-16 16:24:06,001 DEBUG    chardet.charsetprober:65 SHIFT_JIS Japanese prober hit error at byte 74979
2024-02-16 16:24:06,047 DEBUG    chardet.charsetprober:66 EUC-JP Japanese prober hit error at byte 50943
2024-02-16 16:24:06,075 DEBUG    chardet.charsetprober:64 GB2312 Chinese prober hit error at byte 52901
2024-02-16 16:24:06,103 DEBUG    chardet.charsetprober:64 EUC-KR Korean prober hit error at byte 50943
2024-02-16 16:24:06,130 DEBUG    chardet.charsetprober:64 CP949 Korean prober hit error at byte 50943
2024-02-16 16:24:06,158 DEBUG    chardet.charsetprober:64 Big5 Chinese prober hit error at byte 50944
2024-02-16 16:24:06,185 DEBUG    chardet.charsetprober:64 EUC-TW Taiwan prober hit error at byte 50943
2024-02-16 16:24:06,214 DEBUG    chardet.charsetprober:64 Johab Korean prober hit error at byte 52901
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 windows-1251 Russian confidence = 0.01
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 KOI8-R Russian confidence = 0.01
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 ISO-8859-5 Russian confidence = 0.0
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 MacCyrillic Russian confidence = 0.0
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 IBM866 Russian confidence = 0.02409316779801155
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 IBM855 Russian confidence = 0.0
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 ISO-8859-7 Greek confidence = 0.0
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 windows-1253 Greek confidence = 0.0
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 ISO-8859-5 Bulgarian confidence = 0.0
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 windows-1251 Bulgarian confidence = 0.01
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 TIS-620 Thai confidence = 0.01
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 ISO-8859-9 Turkish confidence = 0.5782542205214045
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 windows-1255 Hebrew confidence = 0.0
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 windows-1255 Hebrew confidence = 0.01
2024-02-16 16:24:06,314 DEBUG    chardet.charsetprober:98 windows-1255 Hebrew confidence = 0.01

You might not notice this in a normal application, but I'm fetching these in a loop, with cachecontrol as the session, with a hot cache.

Here are traces for this change (the green block is get_package_metadata and the total time is visible top-right):

Before:

image

After:

image

This is documented near the beginning of
https://packaging.python.org/en/latest/specifications/core-metadata/ and
warehouse currently serves these as `content-type: application/octet-stream`
with no encoding, so requests spends a surprising amount of time on what are
generally lower-ascii files trying to guess an encoding.
Copy link

codecov bot commented Feb 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (a1347d5) 94.80% compared to head (4fdb8f6) 95.80%.
Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #22      +/-   ##
==========================================
+ Coverage   94.80%   95.80%   +1.00%     
==========================================
  Files          10       10              
  Lines         616      620       +4     
  Branches      112      112              
==========================================
+ Hits          584      594      +10     
+ Misses         30       22       -8     
- Partials        2        4       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -383,7 +383,7 @@ def get_package_metadata(
r.raise_for_status()
digester.update(r.content)
digester.finalize()
return r.text
return r.content.decode("utf-8")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return r.content.decode("utf-8")
return r.content.decode("utf-8", "surrogateescape")

Because we don't want to crash on the inevitable case where someone somehow created a wheel with non-UTF-8 metadata.

@jwodder jwodder added the performance Efficient use of time and space label Feb 17, 2024
@thatch
Copy link
Contributor Author

thatch commented Feb 20, 2024

I moved the bulk of the code to a bytes api, because that's really what I was after. Added surrogateescape as requested, and there are tests.

@jwodder jwodder merged commit 7cdf45e into jwodder:master Feb 20, 2024
14 checks passed
@jwodder
Copy link
Owner

jwodder commented Feb 24, 2024

This is now released in pypi-simple 1.5.0. Thank you for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Efficient use of time and space
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants