Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store request info in WARC #1

Open
jnioche opened this issue Apr 20, 2016 · 4 comments
Open

Store request info in WARC #1

jnioche opened this issue Apr 20, 2016 · 4 comments
Assignees

Comments

@jnioche
Copy link
Member

jnioche commented Apr 20, 2016

we currently deal with the response only

@jnioche
Copy link
Member Author

jnioche commented Jul 20, 2016

@sebastian-nagel assigning to you but I can do it when I am back from hols

@jnioche
Copy link
Member Author

jnioche commented Jul 21, 2016

Couldn't find a way of accessing the complete request from httpclient so we'll have to do with the response only.

@jnioche
Copy link
Member Author

jnioche commented Jul 21, 2016

Instead we can store the http header stuff in the warcinfo section

    String userAgent = AbstractHttpProtocol.getAgentString(getConf());
        fields.put("http-header-user-agent", userAgent);
        fields.put("http-header-from",
                ConfUtils.getString(getConf(), "http.agent.email"));

        byte[] warcinfo = WARCRecordFormat.generateWARCInfo(fields);

see [https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/#warcinfo]

@sebastian-nagel
Copy link
Collaborator

Indeed, information in request records is mostly redundant:

  • host, IP address, requested URI and date are also contained in the response record
  • user agent info to be added to warcinfo
  • also accept params (Accept, Accept-Language, Accept-Encoding) are constant for all requests
    Need to check whether WARC tools and libraries do not choke without request records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants