-
Notifications
You must be signed in to change notification settings - Fork 222
Contributing
All pull requests will be appreciated! There are lots of areas for contributing: user guide, documentation, bug reports, unit tests (we have very few), translation of old portuguese code, bug fixes...
Please, before start working, check if an issue already exists and if someone else is already working on it. If not, create an issue if it does not exist and/or tell you are going to start working on it.
Use 4 spaces indentation in your PRs.
About new features, for sure the most needed and most challenging one is parsing (decoding) new forensic artifacts or file formats and keeping up to date with new versions of already supported artifacts.
To support a new artifact, first you need to detect it. You should add a new mimetype definition in conf/CustomSignatures.xml file. It can be based on known file signature or, if non existent, on file name or extension. For example, lets define a new non-standard mimetype named 'application/x-new-mimetype':
<mime-type type="application/x-new-mimetype">
<magic priority="50">
<match value="SIGNATURE" type="string" offset="0"/>
</magic>
<glob pattern="*.newext"/>
</mime-type>
Basically it will search for 'SIGNATURE' string at offset zero of analyzed files. If found, the 'contentType' of the file will be set to 'application/x-new-mimetype'. If not found AND if the file does not contain any of all other defined signatures (by Tika library or in CustomSignatures.xml), the *.newext extension will be tested. If it matches, 'contentType' will also be set to 'application/x-new-mimetype'.
After a the new mimetype is defined, you can add all files identified as 'application/x-new-mimetype' to a new category. For that, simply add a new entry in conf/CategoriesByTypeConfig.txt file:
New Category = application/x-new-mimetype
If you are lucky and know a command line tool that already decodes the new artifact, you can configure IPED to automatically run that tool and import its output in conf/ExternalParsers.xml file. See details in User Manual
If not, you will need to create a new java parser implementing org.apache.tika.parser.AbstractParser interface and install your new parser in conf/ParserConfig.xml. Parsers should be immutable, so they will be thread-safe. AbstractParser interface has 2 methods:
public Set<MediaType> getSupportedTypes(ParseContext context){
return MediaType.parse("application/x-new-mimetype");
}
That means all files identified as 'application/x-new-mimetype' will be processed by the new parser. The second method:
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException
must do the hard work of file decoding. Below the parameters are detailed:
- stream: it is the file content inputStream that will be processed, be careful to not hold all file content in memory if it is large;
- metadata: it provides metadata information about the file, such as original filename, filepath, contentType, filesize. You can also use it to store internal metadata extracted from the file in the parsing process, such as number of tables if it is a database, or exif info if it is an image;
- handler: it will collect SAX events produced by the parser. It is commonly used to collect text extracted from the file that will be indexed or searched for regex patterns by another module. You can also output formatted html, instead of plain text, that can be rendered as a html report of the artifact by the html viewer.
- context: it provides context information about the parsing, like parser configuration options;
Please check more details at Tika site
A parser can also extract subitems from container formats like zip, mbox or sqlite. Many file formats can contain both subitems and text to be indexed, like EML, PPT, PDF. For that reason, the parser is also responsible for extracting subitems, so the artifact is decoded only once. To extract a subitem from an artifact, first you need to get the subitem extractor from the context:
EmbeddedDocumentExtractor extractor = context.get(EmbeddedDocumentExtractor.class,
new ParsingEmbeddedDocumentExtractor(context));
If the file type is configured to be expanded, IPED will return you an extractor that will create a new item into the case corresponding to the subitem and will send it automatically to the processing queue with the command below:
extractor.parseEmbedded(subitemInputStream, handler, subitemMetadata, isHtml);
- subitemInputStream: it is the subitem content that you need to extract and send to the extractor, be careful to not hold large contents in memory;
- handler: it is the same handler used by the parent item;
- subitemMetadata: it is the subitem metadata that you need to fill in with information provided by the parent container, like internal subitem path, name and dates;
- isHtml: flag to enable html tags to be sent to the handler, normally set to true;
If the file type is not configured to be expanded, the default extractor (ParsingEmbeddedDocumentExtractor) will be used. It simply parses the subitems, using the right parser, and concatenates their text to the parent container text.
Take a look at iped-parsers module, it contains lots of parsers to be used as examples. You can start looking at MboxParser (a container one) and LNKParser (extracts text).