Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make MetaData multi-valued to preserve values of repeating WARC and HTTP headers #98

Merged
merged 2 commits into from
Nov 27, 2024

Conversation

sebastian-nagel
Copy link
Contributor

MetaData objects which hold (among other) the headers of WARC records and HTTP captures should be multi-valued to store the values of repeated values as list.

The core objective is to make multiple WARC and HTTP headers extracted into WAT files, see also commoncrawl#18. The WAT specification does not tell anything about repeated headers and the given examples do include any repeated header.

Depart from the ubiquitous "Set-Cookie" HTTP header, more and more HTTP headers repeat in the HTTP header. As an example, the number of WARC response records (out of 31498) from a single Common Crawl WARC file where a HTTP header was repeated:

8356    set-cookie
4959    link
2022    server-timing
1321    vary
 983     x-powered-by
 592     cache-control
 361     x-frame-options
 285     x-content-type-options
 246     strict-transport-security
 155     x-xss-protection
  88      content-security-policy
  84      referrer-policy
  42      simplycom-server
  37      server
  31      x-permitted-cross-domain-policies
  28      pragma
 ...

See also the WARC response record included in this PR and used as test resource.

In addition, proposed WARC headers are allowed (or desired) to occur multiple times, e.g. iipc/warc-specifications#42.

…TTP headers

- code cleanup: fix indentation, remove unneeded return statements
@ato ato merged commit dcbb052 into iipc:master Nov 27, 2024
5 checks passed
@ato
Copy link
Member

ato commented Nov 29, 2024

Thanks. Released as 1.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants