-
Notifications
You must be signed in to change notification settings - Fork 15
warc extractor
Warc-extractor.py is a tool designed to filter and extract files from warc archive files. This script is designed to perform three different purposes.
- Provide basic information as to what a collection of warc files contain.
- Create new warc files containing only filtered elements of old warc files.
- Dump the file contents of a warc file to disk.
python3 warc-extractor.py
Running the program without any arguments scans all of the warc files in the current directory and outputs some basic information about those files.
Warc-extractor.py accepts an unlimited number of filter options. A filter option controls which warc entries the script scans.
python3 warc-extractor.py warc-type:request
In the above example the script will output basic information about all of the warc entries where the warc header 'warc-type' is set to request (case insensitive). Substrings are allowed in the second part so 'warc-type:requ' would be equivalent while 'warc-type:re' would return both 'request' and 'response' entries.
Many warc entries also contain HTTP headers which can also be accessed by filter.
python3 warc-extractor.py http:content-type:pdf
The above script finds all warc entries that contain PDF's. Specifically it would filter out any warc entry that does not contain an HTTP header 'content-type' that contains the string 'pdf'. (Note: imputing any HTTP filter implicitly filters out any warc entry that does not contain an HTTP request or response.)
There is also some information found in an HTTP object's version line. This information can be access via some special operators: error, command, path, status, version. The most important being error.
python3 warc-extractor.py http:error:200
The above script would filter out any HTTP responses that did not return error code 200, as well as implicitly remove HTTP requests which do not contain error codes.
Additionally, negative searches are also allowed.
python3 warc-extractor.py \!http:content-type:pdf
The above script would return all warc entries that do not contain contain PDF's. (Note: the '' character is required because '!' is a reserved character in bash.)
Once you have verified that the script is only grabbing those warc entries that are required. The contents of the found warc entries can be dumped in two different ways.
python3 warc-extractor.py some:filter -dump warc
The above script would create a new warc file containing only the filtered elements.
python3 warc-extractor.py some:filter -dump content
The above script would attempt to extract the contents of the filtered entries. (Note: the -dump flag implicitly adds "warc-type:response" and "content-type:application/http" to the filters. As warc entries that do not match these filters do not contain file-like objects.)
-
-h
- Outputs the command line help screen.
- example: python3 warc-extractor.py -h
-
-string
- Limits which .warc files the extractor looks in.
- example: python3 warc-extractor.py -string archive
- (Will only look in .warc files that contain the string "archive" in its filename.
-
-path
- Changes which folder the extractor looks in for .warc files.
- example: python3 warc-extractor.py -path /path/to/folder
- (Looks in folder /path/to/folder to find warc files.)
-
-output_path
- Changes the folder dumped files are placed in.
- example: python3 warc-extractor.py -output /path/to/folder
- (All dumped files will be placed in /path/to/folder)
-
-output
- Changes the name of the warc file the extractor outputs to.
- example: python3 warc-extractor.py -output new-warc.warc
- (new-warc.warc will be created instead of output.warc)
-
-dump
- Triggers output of data. Defaults to no output.
- Choices are 'content' and 'warc'.
- 'warc' will output all warc entries that remain after filter to 'output.warc'.
- 'content' will output the saved file in all warc entries that remain after filter.
- example: python3 warc-extractor.py -dump content
-
-output
- Changes the name of the warc file the extractor outputs to.
- example: python3 warc-extractor.py -output new-warc.warc
- (new-warc.warc will be created instead of output.warc)
-
-output
- Changes the name of the warc file the extractor outputs to.
- example: python3 warc-extractor.py -output new-warc.warc
- (new-warc.warc will be created instead of output.warc)
-
-silence
- Boolean variables, silences collection of index data and prevents script from writing to terminal.
-
-error
- Debugging command, see troubleshooting below.
To create a warc file containing all HTTP responses that are not file-like objects.
python3 warc-extractor.py -dump warc warc-type:response \!content-type:application/http
To dump all PDF's from a warc file to disk.
python3 warc-extractor.py -dump content http:content-type:pdf
To dump everything a warc file contains to disk.
python3 warc-extractor.py -dump content http:error:200
Warc files are complicated and huge. Creating a single script that can properly handle all of the many strange and wonderful objects that might be hidden in a warc file is a large undertaking. Because of this bugs are inevitable.
The script contains a -error command script designed to make dealing with problematic warc entries a bit easier. If the -error tag is supplied to the script, the script will do it's best to skip all entries that cause errors then write all problematic entries to a new warc file 'error.warc'. Should this script error, please try running it again with the -error tag and then upload the resulting 'error.warc' file along with the bug report.
There are many possible problems a warc file could contain that are not limited to specific entries. In these situations the -error tag will not prevent the error and will not create the error.warc file. In these cases please still fill out a bug report. However, the problem is unlikely to be fixed unless I can get access to the warc file that created the problem.
One final note, this script was programmed and tested on a Linux platform. In theory it should work on any platform that Python 3 works on; however, I make no guarantees. Help on this issue would be greatly appreciated.