Skip to content

Commit 315845f

Browse files
Add documentation about archive issues (#18913)
* Add documentation about archive issues --------- Co-authored-by: Mike Fiedler <miketheman@gmail.com>
1 parent 7939bcc commit 315845f

File tree

2 files changed

+67
-1
lines changed

2 files changed

+67
-1
lines changed

docs/user/archives.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Archive Formats
2+
3+
Wheels and source distributions use the ZIP and tar format to distribute
4+
multiple files within a single artifact. To avoid archive differential / confusion
5+
attacks due to complexities of archive formats, PyPI rejects archives which
6+
use unnecessary and uncommon features of archives, such as archives that are
7+
designated for multiple disks or archives that are constructed to intentionally
8+
confuse archive implementations.
9+
10+
This page details some of the the archive format features that PyPI rejects so if you
11+
encounter an error you can debug the issue and upload the fixed archive to PyPI.
12+
13+
## Multiple or malformed central directory
14+
15+
Archives often support having multiple central directories
16+
or indexes to allow for "append-only" updates. This is not allowed in archives
17+
on PyPI to avoid confusion while handling multiple central directories.
18+
Additionally, central directories must be specified correctly such that
19+
none of the central directory can be missed or misinterpreted
20+
(such as the offset within the archive, size, etc).
21+
22+
## Missing file in central directory
23+
24+
This error occurs when a file is within the archive but the
25+
file is not recorded in the central directory. This may occur when
26+
a file is "deleted" by removal from the central directory but its
27+
contents are not removed from the archive file itself. This can also
28+
occur if the central directory references a file whose data is not
29+
within the archive.
30+
31+
## Duplicate file entries
32+
33+
There is more than one entry that shares the same name as another entry
34+
within the archive, either within the "central directory" of file entries
35+
or multiple entries. This is disallowed as some implementations process
36+
duplicates differently.
37+
38+
## Filename is not valid
39+
40+
The names of files within the archive must all be UTF-8 encoded bytes
41+
without unprintable characters.
42+
Unprintable characters as the Unicode codepoints `0x00-0x20` and `0x7F`.
43+
44+
## Negative offset
45+
46+
One of the relative offsets specified within the archive
47+
is negative instead of positive.
48+
49+
## Duplicate extra metadata
50+
51+
There is two or more ZIP extra metadata field
52+
with the same ID that have security relevance, such
53+
as marking a ZIP as ZIP64 or defining the Unicode filename.
54+
55+
## Trailing data or comments
56+
57+
Many archives support trailing or prepended data
58+
or comments within records. PyPI disallows these features
59+
to avoid smuggling other archive records within comments.
60+
61+
## Further reading
62+
63+
This [white paper](https://www.usenix.org/system/files/usenixsecurity25-you.pdf) details many aspects of ZIPs, differentials, and exploitable archive features.

warehouse/forklift/legacy.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -344,7 +344,10 @@ def _is_valid_dist_file(filename, filetype):
344344
# to avoid parser differentials.
345345
zip_ok, zip_error = zipfiles.validate_zipfile(filename)
346346
if not zip_ok:
347-
return False, f"ZIP archive not accepted: {zip_error}"
347+
return False, (
348+
f"ZIP archive not accepted: {zip_error}. "
349+
f"See https://docs.pypi.org/archives for more information"
350+
)
348351

349352
except zipfile.BadZipFile: # pragma: no cover
350353
return False, None

0 commit comments

Comments
 (0)