-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication & "Not Modified" WARC Records #224
Comments
This is a Heritrix bug. From the WARC spec, chapter 5.9 on WARC-Payload-Digest:
(emphasis is mine) This clearly means that in server-not-modified revisit records, this field should either be omitted or be equal to the original record. |
I do wonder if OpenWayback can gracefully handle the absence of the digest? Presumably it should if original URL and date are provided? |
If they are then yes, I think that should work. Worth building in a test case anyway. In the case of the above I'm wondering whether we should attempt to handle this despite the fact it's non-compliant? Alternatively, we could give an example of a way to work around it outside OpenWayback (I'm thinking of a script to create 'dummy' CDX lines for revisits with no matching response). |
We could probably detect empty payload digests (should always have the same value) and process as if there wasn't any digest. In the absence of original URI and/or date, that would mean using the latest "previous" capture that isn't a revisit. |
This relates to #117 |
When crawling using Heritrix, if both
sendIfModifiedSince
andwriteRevisitForNotModified
are set totrue
(although the latter has been deprecated, presumably equivalent to always beingtrue
), a server may respond with an empty response and a WARC record like the following can be written (taken from the warc-specification project):Here the
WARC-Payload-Digest
has been calculated on the empty, zero-length content. As a result, it won't match that of the earlier record and OpenWayback won't find the original payload.The WARC spec. does say that:
Whether this means that the
WARC-Payload-Digest
should be calculated on revisited record, I'm not sure. However, the above is a live, written WARC so we should probably figure out how to handle such things.The text was updated successfully, but these errors were encountered: