Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure CDX status is a string #739

Merged
merged 1 commit into from
Aug 9, 2022
Merged

Conversation

edsu
Copy link
Contributor

@edsu edsu commented Jul 14, 2022

Description

If a CDX entry has a status that is an integer instead of a string (which is what ArchiveWeb.page currently generates) then pywb will blow up during replay:

Traceback (most recent call last):
  File "/Users/edsummers/Projects/pywb/pywb/warcserver/basewarcserver.py", line 77, in __call__
    result = endpoint(environ, **args)
  File "/Users/edsummers/Projects/pywb/pywb/warcserver/basewarcserver.py", line 46, in post_fullrequest
    return handler(params)
  File "/Users/edsummers/Projects/pywb/pywb/warcserver/handlers.py", line 160, in __call__
    out_headers, resp = loader(cdx, params)
  File "/Users/edsummers/Projects/pywb/pywb/warcserver/resource/responseloader.py", line 37, in __call__
    entry = self.load_resource(cdx, params)
  File "/Users/edsummers/Projects/pywb/pywb/warcserver/resource/responseloader.py", line 214, in load_resource
    if not status or not status.startswith(('2', '4', '5')):
AttributeError: 'int' object has no attribute 'startswith'

This commit will guard against that by casting status int values to a str.

Motivation and Context

It would be useful to be able to unpack a WACZ file created with ArchiveWeb.page (or other tools) and have it playable in pywb by moving the WARC and CDXJ files into place. Currently this is not possible because WACZ uses an int for the CDXJ status, whereas pywb expects it to be a string in multiple places. In the spirit of the robustness principle this change will be lenient in what is accepted in CDXJ files by casting an int status value to a str.

Types of changes

  • Replay fix (fixes a replay specific issue)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added or updated tests to cover my changes.
  • All new and existing tests passed.

If a CDXJ entry has a status that is an int that can cause problems in
multiple places in pywb. This change ensures that int status lines are
converted to str.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants