Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication & "Not Modified" WARC Records #224

Open
PsypherPunk opened this issue Mar 11, 2015 · 5 comments
Open

Deduplication & "Not Modified" WARC Records #224

PsypherPunk opened this issue Mar 11, 2015 · 5 comments

Comments

@PsypherPunk
Copy link
Contributor

When crawling using Heritrix, if both sendIfModifiedSince and writeRevisitForNotModified are set to true (although the latter has been deprecated, presumably equivalent to always being true), a server may respond with an empty response and a WARC record like the following can be written (taken from the warc-specification project):

WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://www.bl.uk/
WARC-Date: 2014-11-24T08:13:54Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 91.194.151.38
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/server-not-modified
WARC-Truncated: length
WARC-Etag: "4078134-aed6-6117a140"
WARC-Record-ID: <urn:uuid:d41c9044-fad4-402a-bdc8-ff6c63d0f419>
Content-Length: 0

Here the WARC-Payload-Digest has been calculated on the empty, zero-length content. As a result, it won't match that of the earlier record and OpenWayback won't find the original payload.

The WARC spec. does say that:

For records using this profile, the payload is defined as the original payload content from which a 'LastModified' and/or 'ETag' value was taken.

Whether this means that the WARC-Payload-Digest should be calculated on revisited record, I'm not sure. However, the above is a live, written WARC so we should probably figure out how to handle such things.

@kris-sigur
Copy link
Member

This is a Heritrix bug.

From the WARC spec, chapter 5.9 on WARC-Payload-Digest:

An optional parameter indicating the algorithm name and calculated value of a digest applied to the
payload referred to or contained by the record - which is not necessarily equivalent to the record block

(emphasis is mine)

This clearly means that in server-not-modified revisit records, this field should either be omitted or be equal to the original record.

@kris-sigur
Copy link
Member

I do wonder if OpenWayback can gracefully handle the absence of the digest? Presumably it should if original URL and date are provided?

@PsypherPunk
Copy link
Contributor Author

If they are then yes, I think that should work. Worth building in a test case anyway.

In the case of the above I'm wondering whether we should attempt to handle this despite the fact it's non-compliant? Alternatively, we could give an example of a way to work around it outside OpenWayback (I'm thinking of a script to create 'dummy' CDX lines for revisits with no matching response).

@kris-sigur
Copy link
Member

We could probably detect empty payload digests (should always have the same value) and process as if there wasn't any digest.

In the absence of original URI and/or date, that would mean using the latest "previous" capture that isn't a revisit.

@kris-sigur
Copy link
Member

This relates to #117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants