-
Notifications
You must be signed in to change notification settings - Fork 4
Home
knoxa edited this page Jun 2, 2016
·
3 revisions
This is the public domain MUC-3 (and also MUC-4) dataset from NIST Information Extraction converted to HTML.
The raw HTML data provides a starting point for text analytics. Publishing it as a GitHub project allows experiments in knowledge representation and reasoning. For example, branching the repository and capturing the results of text processing as semantic mark-up in the original document. This is an attractive idea for a number of reasons:
- It means that information extracted from text is captured in the context of the source text - making it easier to check the facts.
- The results of information extraction move with the data. If the HTML is moved, so is the analysis.
- Processing can be iterative. A text analysis process can take advantage of mark-up created by some earlier process. These processes can be separated in time and space.
- The mark-up can manually edited. Human intervention can be interposed in a chain of information extraction processes.
- Semantic mark-up supports semantic indexing of text and makes it easier to extract structured data from text.
This project is about data and knowledge representation. See Baleen for a framework and toolkit capable of performing the information extraction functions that will be discussed here.