Home

muc3

This is the public domain MUC-3 (and also MUC-4) dataset from NIST Information Extraction converted to HTML.

The raw HTML data provides a starting point for text analytics. Publishing it as a GitHub project allows experiments in knowledge representation and reasoning. For example, branching the repository and capturing the results of text processing as semantic mark-up in the original document. This is an attractive idea for a number of reasons:

It means that information extracted from text is captured in the context of the source text - making it easier to check the facts.
The results of information extraction move with the data. If the HTML is moved, so is the analysis.
Processing can be iterative. A text analysis process can take advantage of mark-up created by some earlier process. These processes can be separated in time and space.
The mark-up can manually edited. Human intervention can be interposed in a chain of information extraction processes.
Semantic mark-up supports semantic indexing of text and makes it easier to extract structured data from text.

This project is about data and knowledge representation. See Baleen for a framework and toolkit capable of performing the information extraction functions that will be discussed here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

muc3

Clone this wiki locally