Make MetaData multi-valued to preserve values of repeating WARC and HTTP headers #98
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
MetaData objects which hold (among other) the headers of WARC records and HTTP captures should be multi-valued to store the values of repeated values as list.
The core objective is to make multiple WARC and HTTP headers extracted into WAT files, see also commoncrawl#18. The WAT specification does not tell anything about repeated headers and the given examples do include any repeated header.
Depart from the ubiquitous "Set-Cookie" HTTP header, more and more HTTP headers repeat in the HTTP header. As an example, the number of WARC response records (out of 31498) from a single Common Crawl WARC file where a HTTP header was repeated:
See also the WARC response record included in this PR and used as test resource.
In addition, proposed WARC headers are allowed (or desired) to occur multiple times, e.g. iipc/warc-specifications#42.