Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle CDXException and respond with HTTP 400 Bad Request #626

Merged
merged 2 commits into from
Apr 27, 2021

Conversation

sebastian-nagel
Copy link
Contributor

Description

WarcServer should handle CDXExceptions and respond with HTTP status "400 Bad Request".

The PR reads the first line of the CDX result early so that the exception is raised and be caught and forwarded as error.

The unit test included in this PR is based on #624 (included in this PR).

Motivation and Context

  • ensures backward-compatibility with the CDXJ server of PyWB 0.33.2
  • some CDX clients (eg. cdx-toolkit) iterate over result pages until an invalid page number is hit and the server responds with "404 Bad Request". If the error is not properly handled and the server responds with status 500, the client tries the same request (with the page number out of range) again and again.

Screenshots

Log output and stacktrace of the error:

Traceback (most recent call last):
  File ".../site-packages/gevent/pywsgi.py", line 970, in handle_one_response
    self.run_application()
  File ".../site-packages/gevent/pywsgi.py", line 918, in run_application
    self.process_result()
  File ".../site-packages/gevent/pywsgi.py", line 902, in process_result
    for data in self.result:
  File ".../site-packages/pywb/warcserver/handlers.py", line 100, in check_str
    for line in lines:
  File ".../site-packages/pywb/warcserver/handlers.py", line 21, in <genexpr>
    return content_type, (cdx.to_cdxj(fields) for cdx in cdx_iter)
  File ".../site-packages/pywb/warcserver/access_checker.py", line 206, in wrap_iter
    for cdx in cdx_iter:
  File ".../site-packages/pywb/warcserver/index/fuzzymatcher.py", line 162, in get_fuzzy_iter
    for cdx in cdx_iter:
  File ".../site-packages/pywb/warcserver/index/cdxops.py", line 132, in <genexpr>
    return (cdx for cdx, _ in zip(cdx_iter, range(limit)))
  File ".../site-packages/pywb/warcserver/index/aggregator.py", line 76, in <genexpr>
    cdx_iter = (add_source(cdx, name) for cdx in cdx_iter)
  File ".../site-packages/pywb/warcserver/index/zipnum.py", line 166, in gen_cdx
    for blk in blocks:
  File ".../site-packages/pywb/warcserver/index/zipnum.py", line 280, in idx_to_cdx
    for idx in idx_iter:
  File ".../site-packages/pywb/warcserver/index/zipnum.py", line 246, in compute_page_range
    raise CDXException(msg.format(curr_page, total_pages - 1))
pywb.warcserver.index.cdxobject.CDXException: Page 1 invalid: First Page is 0, Last Page is 0
2021-03-18T14:44:55Z {'REMOTE_ADDR': '127.0.0.1', 'REMOTE_PORT': '42774', 'HTTP_HOST': 'localhost:45463', (hidden keys: 21)} failed with CDXException

127.0.0.1 - - [2021-03-18 15:44:55] "GET /CC-MAIN-2019-51/index?url=does-not-exist-domain.org&matchType=domain&page=1 HTTP/1.1" 500 161 0.040050

Types of changes

  • Replay fix (fixes a replay specific issue)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added or updated tests to cover my changes.
  • All new and existing tests passed.

- make that CDXException is raised early so that it can be handled
  in the IndexHandler
@ikreymer
Copy link
Member

HI @sebastian-nagel, thanks for opening all these PRs, sorry for delay in reviewing, but should be able to get them merged by next week and do a new release!

@sebastian-nagel
Copy link
Contributor Author

Hi @ikreymer, thanks! And no need to hurry. I'm still testing the upgrade of index.commoncrawl.org to a recent PyWB version (from 0.33.2) and might come up with more questions, issues or PRs in the next days. Thanks in advance!

@ikreymer
Copy link
Member

Hm, it seems the lazy evaluation there makes it harder to get the error. Thanks for the fix!

@ikreymer ikreymer merged commit 212691b into webrecorder:master Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants