Skip to content

Commit ff8f18d

Browse files
authored
chore(build): remove source files from wheel RECORD (#14279)
## Problem PyPI has issued a deprecation warning for our `ddtrace` package wheels. The wheel RECORD files list source files (`.c`, `.cpp`, `.h`, `.pyx`, etc.) that were removed during post-processing but the RECORD wasn't updated to reflect this. This mismatch between RECORD contents and actual wheel contents will become a hard error in future PyPI releases. ## Solution We've updated our wheel build pipeline to ensure RECORD file integrity: 1. **Enhanced `zip_filter.py`** - Our existing script for removing source files from wheels now also updates the RECORD file to maintain consistency between listed and actual contents. 2. **Unified build approach** - All platforms (Linux, macOS, Windows) now use the same `zip_filter.py` script for consistent source file removal and RECORD updating. 3. **Automated validation** - Added a new validation script and CI step that verifies every built wheel has a RECORD file that accurately reflects its contents, including proper SHA256 hashesand file sizes. This ensures our wheels comply with PyPI requirements while maintaining the same distribution content - compiled extensions without source files, but with accurate metadata. ## Checklist - [x] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [ ] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
1 parent 15eb170 commit ff8f18d

File tree

3 files changed

+196
-9
lines changed

3 files changed

+196
-9
lines changed

.github/workflows/build_python_3.yml

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -69,20 +69,16 @@ jobs:
6969
# See: https://stackoverflow.com/a/65402241
7070
CIBW_ENVIRONMENT_MACOS: CMAKE_BUILD_PARALLEL_LEVEL=24 SYSTEM_VERSION_COMPAT=0 CMAKE_ARGS="-DNATIVE_TESTING=OFF"
7171
CIBW_REPAIR_WHEEL_COMMAND_LINUX: |
72+
python scripts/zip_filter.py {wheel} \*.c \*.cpp \*.cc \*.h \*.hpp \*.pyx \*.md &&
7273
mkdir ./tempwheelhouse &&
7374
unzip -l {wheel} | grep '\.so' &&
7475
auditwheel repair -w ./tempwheelhouse {wheel} &&
75-
for w in ./tempwheelhouse/*.whl; do
76-
python scripts/zip_filter.py $w \*.c \*.cpp \*.cc \*.h \*.hpp \*.pyx \*.md
77-
mv $w {dest_dir}
78-
done &&
76+
mv ./tempwheelhouse/*.whl {dest_dir} &&
7977
rm -rf ./tempwheelhouse
8078
CIBW_REPAIR_WHEEL_COMMAND_MACOS: |
81-
zip -d {wheel} \*.c \*.cpp \*.cc \*.h \*.hpp \*.pyx \*.md &&
79+
python scripts/zip_filter.py {wheel} \*.c \*.cpp \*.cc \*.h \*.hpp \*.pyx \*.md &&
8280
MACOSX_DEPLOYMENT_TARGET=12.7 delocate-wheel --require-archs {delocate_archs} -w {dest_dir} -v {wheel}
83-
CIBW_REPAIR_WHEEL_COMMAND_WINDOWS: choco install -y 7zip &&
84-
7z d -r "{wheel}" *.c *.cpp *.cc *.h *.hpp *.pyx *.md &&
85-
move "{wheel}" "{dest_dir}"
81+
CIBW_REPAIR_WHEEL_COMMAND_WINDOWS: python scripts/zip_filter.py "{wheel}" "*.c" "*.cpp" "*.cc" "*.h" "*.hpp" "*.pyx" "*.md" && mv "{wheel}" "{dest_dir}"
8682
CIBW_TEST_COMMAND: "python {project}/tests/smoke_test.py"
8783

8884
steps:
@@ -107,6 +103,16 @@ jobs:
107103
with:
108104
only: ${{ matrix.only }}
109105

106+
- name: Validate wheel RECORD files
107+
shell: bash
108+
run: |
109+
for wheel in ./wheelhouse/*.whl; do
110+
if [ -f "$wheel" ]; then
111+
echo "Validating $(basename $wheel)..."
112+
python scripts/validate_wheel.py "$wheel"
113+
fi
114+
done
115+
110116
- if: runner.os != 'Windows'
111117
run: |
112118
echo "ARTIFACT_NAME=${{ matrix.only }}" >> $GITHUB_ENV

scripts/validate_wheel.py

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Validate that a wheel's contents match its RECORD file.
4+
5+
This script checks:
6+
1. All files in the wheel are listed in RECORD
7+
2. All files in RECORD exist in the wheel
8+
3. File hashes match (for files that have hashes in RECORD)
9+
4. File sizes match
10+
"""
11+
12+
import argparse
13+
import base64
14+
import csv
15+
import hashlib
16+
import io
17+
from pathlib import Path
18+
import sys
19+
import zipfile
20+
21+
22+
def compute_hash(data):
23+
"""Compute the urlsafe base64 encoded SHA256 hash of data."""
24+
hash_digest = hashlib.sha256(data).digest()
25+
return base64.urlsafe_b64encode(hash_digest).rstrip(b"=").decode("ascii")
26+
27+
28+
def validate_wheel(wheel_path):
29+
"""Validate that wheel contents match its RECORD file."""
30+
errors = []
31+
32+
with zipfile.ZipFile(wheel_path, "r") as wheel:
33+
# Find the RECORD file
34+
record_path = None
35+
for name in wheel.namelist():
36+
if name.endswith(".dist-info/RECORD"):
37+
record_path = name
38+
break
39+
40+
if not record_path:
41+
errors.append("No RECORD file found in wheel")
42+
return errors
43+
44+
# Parse the RECORD file
45+
record_content = wheel.read(record_path).decode("utf-8")
46+
record_entries = {}
47+
48+
reader = csv.reader(io.StringIO(record_content))
49+
for row in reader:
50+
if not row or len(row) < 3:
51+
continue
52+
53+
file_path, hash_str, size_str = row[0], row[1], row[2]
54+
record_entries[file_path] = {"hash": hash_str, "size": int(size_str) if size_str else None}
55+
56+
# Get all files in the wheel (excluding directories)
57+
wheel_files = set()
58+
for name in wheel.namelist():
59+
# Skip directories (they end with /)
60+
if not name.endswith("/"):
61+
wheel_files.add(name)
62+
63+
record_files = set(record_entries.keys())
64+
65+
# Check for files in wheel but not in RECORD
66+
files_not_in_record = wheel_files - record_files
67+
if files_not_in_record:
68+
for f in sorted(files_not_in_record):
69+
errors.append(f"File in wheel but not in RECORD: {f}")
70+
71+
# Check for files in RECORD but not in wheel
72+
files_not_in_wheel = record_files - wheel_files
73+
if files_not_in_wheel:
74+
for f in sorted(files_not_in_wheel):
75+
errors.append(f"File in RECORD but not in wheel: {f}")
76+
77+
# Validate hashes and sizes for files that exist in both
78+
for file_path in record_files & wheel_files:
79+
# Skip the RECORD file itself
80+
if file_path == record_path:
81+
continue
82+
83+
record_entry = record_entries[file_path]
84+
file_data = wheel.read(file_path)
85+
86+
# Check size
87+
if record_entry["size"] is not None:
88+
actual_size = len(file_data)
89+
if actual_size != record_entry["size"]:
90+
errors.append(
91+
f"Size mismatch for {file_path}: RECORD says {record_entry['size']}, actual is {actual_size}"
92+
)
93+
94+
# Check hash
95+
if record_entry["hash"]:
96+
# Parse the hash format (algorithm=base64hash)
97+
if "=" in record_entry["hash"]:
98+
algo, expected_hash = record_entry["hash"].split("=", 1)
99+
if algo == "sha256":
100+
actual_hash = compute_hash(file_data)
101+
if actual_hash != expected_hash:
102+
errors.append(
103+
f"Hash mismatch for {file_path}: RECORD says {expected_hash}, actual is {actual_hash}"
104+
)
105+
else:
106+
errors.append(f"Unknown hash algorithm {algo} for {file_path} (expected sha256)")
107+
else:
108+
errors.append(f"Invalid hash format for {file_path}: {record_entry['hash']}")
109+
# The RECORD file itself should not have a hash
110+
elif file_path != record_path:
111+
errors.append(f"No hash recorded for {file_path}")
112+
113+
return errors
114+
115+
116+
def main():
117+
parser = argparse.ArgumentParser(description="Validate wheel RECORD file matches contents")
118+
parser.add_argument("wheel", help="Path to wheel file to validate")
119+
120+
args = parser.parse_args()
121+
122+
wheel_path = Path(args.wheel)
123+
if not wheel_path.exists():
124+
print(f"Error: Wheel file not found: {wheel_path}", file=sys.stderr)
125+
sys.exit(1)
126+
127+
print(f"Validating {wheel_path.name}...")
128+
errors = validate_wheel(wheel_path)
129+
130+
if errors:
131+
print(f"\n[ERROR] Found {len(errors)} error(s):", file=sys.stderr)
132+
for error in errors:
133+
print(f" - {error}", file=sys.stderr)
134+
sys.exit(1)
135+
136+
print("[SUCCESS] Wheel validation passed!")
137+
return 0
138+
139+
140+
if __name__ == "__main__":
141+
sys.exit(main())

scripts/zip_filter.py

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,59 @@
11
import argparse
2+
import csv
23
import fnmatch
4+
import io
35
import os
46
import zipfile
57

68

9+
def update_record(record_content, patterns):
10+
"""Update the RECORD file to remove entries for deleted files."""
11+
# Parse the existing RECORD
12+
records = []
13+
reader = csv.reader(io.StringIO(record_content))
14+
15+
for row in reader:
16+
if not row:
17+
continue
18+
file_path = row[0]
19+
# Skip files that match removal patterns
20+
if not any(fnmatch.fnmatch(file_path, pattern) for pattern in patterns):
21+
records.append(row)
22+
23+
# Rebuild the RECORD content
24+
output = io.StringIO()
25+
writer = csv.writer(output, lineterminator="\n")
26+
for record in records:
27+
writer.writerow(record)
28+
29+
return output.getvalue()
30+
31+
732
def remove_from_zip(zip_filename, patterns):
833
temp_zip_filename = f"{zip_filename}.tmp"
34+
record_content = None
35+
36+
# First pass: read RECORD file if it exists
37+
with zipfile.ZipFile(zip_filename, "r") as source_zip:
38+
for file in source_zip.infolist():
39+
if file.filename.endswith(".dist-info/RECORD"):
40+
record_content = source_zip.read(file.filename).decode("utf-8")
41+
break
42+
43+
# Second pass: create new zip without removed files and with updated RECORD
944
with zipfile.ZipFile(zip_filename, "r") as source_zip, zipfile.ZipFile(
1045
temp_zip_filename, "w", zipfile.ZIP_DEFLATED
1146
) as temp_zip:
1247
# DEV: Use ZipInfo objects to ensure original file attributes are preserved
1348
for file in source_zip.infolist():
1449
if any(fnmatch.fnmatch(file.filename, pattern) for pattern in patterns):
1550
continue
16-
temp_zip.writestr(file, source_zip.read(file.filename))
51+
elif file.filename.endswith(".dist-info/RECORD") and record_content:
52+
# Update the RECORD file
53+
updated_record = update_record(record_content, patterns)
54+
temp_zip.writestr(file, updated_record)
55+
else:
56+
temp_zip.writestr(file, source_zip.read(file.filename))
1757
os.replace(temp_zip_filename, zip_filename)
1858

1959

0 commit comments

Comments
 (0)