Skip to content
knoxa edited this page Jun 2, 2016 · 3 revisions

muc3

This is the public domain MUC-3 (and also MUC-4) dataset from NIST Information Extraction converted to HTML.

The raw HTML data provides a starting point for text analytics. Publishing it as a GitHub project allows experiments in knowledge representation and reasoning. For example, branching the repository and capturing the results of text processing as semantic mark-up in the original document. This is an attractive idea for a number of reasons:

  • It means that information extracted from text is captured in the context of the source text - making it easier to check the facts.
  • The results of information extraction move with the data. If the HTML is moved, so is the analysis.
  • Processing can be iterative. A text analysis process can take advantage of mark-up created by some earlier process. These processes can be separated in time and space.
  • The mark-up can manually edited. Human intervention can be interposed in a chain of information extraction processes.
  • Semantic mark-up supports semantic indexing of text and makes it easier to extract structured data from text.

This project is about data and knowledge representation. See Baleen for a framework and toolkit capable of performing the information extraction functions that will be discussed here.

Clone this wiki locally