Skip to content

Extracted Information

knoxa edited this page Feb 25, 2019 · 22 revisions

Events

These are schema.org Actions:

Each event is a sentence extracted from the corpus. The URI is the document URL concatenated with the offset of the start of the sentence in the text. The text of the sentence itself is the the description property. The name property is an event category from the GTD Codebook. An rdfs:member property links each event to its containing document.

Organizations

These are schema.org Organizations, with rdfs:member properties linking each to documents in which the organization is mentioned:

Dates and Times

  • Dates represented as the startTime property of a schema.org Action. The URI for each Action is constructed from the document URL and sentence offset as above. These dates should therefore automatically link to the events extracted above if they come from the same sentence.

  • Datetimes as rdfs:member properties on the sentences that mention them (which in turn are linked by rdfs:member to the documents containing them). The sentence URI's are constructed from document URL and offset (as for events and dates above).

People

Entity Disambiguation

The named entity lists include mentions of people, and names of people. We can use these lists to construct a directed acyclic graph of mention, name and token nodes; with mention nodes linked to a name node for each name they contain (a person mention may include aliases and nicknames), and name nodes linked to a token node for each part of the name. See mentions.csv for a graph constructed in this way. It is an edge list, where node labels are text spans from the corpus (forced to lower case) and the node types are 0 (a mention span), 1 (a name span) or 2 (a token span). A Dover version of this graph (called PEOPLE) can be found in the gh-pages branch of this project. This graph relates mentions through common names and name tokens, and can therefore be used for entity disambiguation.

One possible approach is to construct a similarity measure, S = AAT + wA2D-1(AT)2, from the adjacency matrix of the graph. This gives the bibliographic coupling between mentions through common names (where Aij in Dover is the number of links from i to j), plus a weighted contribution from the bibliographic coupling through common name tokens (weighted by w, and inversely proportional to the token degrees - so common names contribute less). Sij puts a number on the similarity between mentions i and j (when both nodes are type 0). This will be non zero if any part of a name (i.e. any single name token) matches, and further analysis is required to distinguish similar names of different individuals (family members say). One way to do this is to examine the sub-graph induced by the two mentions and any names and tokens found by following links out from those mentions. This sub-graph should NOT have a sub-graph isomorphic to the pattern: 2 <-- 1 --> 2 <-- 1 --> 2 (i.e. a name node linked to two token nodes) where the degree of the two token nodes at either end this pattern is one. This is equivalent to saying that it's OK for two names for the same person to be different if one is a subset of the other (in terms of name tokens), but not OK if both have a token not seen in the other. This "W" pattern also catches the case where the difference between tokens is caused by a spelling mistake (e.g. "febe" and "fede"). Some further test on the labels of the two degree 1 token nodes (e.g. Levenshtein distance) can be applied to decide whether to permit or deny a match in these instances. See similar.csv for the results of an experiment along these lines. Each record of this file asserts an equivalence between person mentions.

Entity Linking

A number of people mentioned in the corpus can be found in Wikipedia, and hence DBpedia. Extracting RDF from DBpedia creates MUC-3 personality linked data that can be used as a reference dataset for entity linking.

Wikipedia/DBpedia allow multiple names for a person through the use of "redirect" pages, but Wikipedia page names must be unique. The upshot is that there is no ambiguity in the names in the personality data (which are the skos:prefLabel and skos:altLabel RDF properties). Constructing a dictionary from these names, and searching the mentions of people list, gives the mapping of mention to DBpedia URI found in identity.csv.

To do...

It is desirable that information extracted by different processes at different times should be capable of being fused. This is achieved above (for events and dates) by making a sentence an Action in both cases, then fusing on the URI of the action (sentence). Is there a better way? Should there be an ontology for the document text (which would include a "sentence" type)?

Clone this wiki locally