Skip to content

gangeli/ParsingTime

Repository files navigation

Parsing Time: Learning to Interpret Time Expressions

----------
Abstract:
----------
We present a probabilistic approach for learning to interpret temporal 
phrases given only a corpus of utterances and the times they reference. 
While most approaches to the task have used regular expressions and similar 
linear pattern interpretation rules, the possibility of phrasal embedding 
and modification in time expressions motivates our use of a compositional 
grammar of time expressions. This grammar is used to construct a latent 
parse which evaluates to the time the phrase would represent, as a logical 
parse might evaluate to a concrete entity. In this way, we can employ a 
loosely supervised EM-style bootstrapping approach to learn these latent 
parses while capturing both syntactic uncertainty and pragmatic ambiguity 
in a probabilistic framework. We achieve an accuracy of 72% on an adapted 
TempEval-2 task -- comparable to state of the art systems.

----------
Compilation
----------
Scala 2.9.1
JavaNLP April 9 2012

Most experiments can be run with something in the rc file
  interpret: run the interpretation model (last valid run)
  interpretAux: run the interpretation model (good parameters)
  sutimeInterpret: run SUTime
  heideltimeInterpret: run HeidelTime
  gutimeInterpret: run GUTime

  detect: run the detection model (last valid run)

Models
	dist/time.jar: the program
	dist/interpretModel.ser.gz: serialized temporal interpretation model
	dist/detectModel.ser.gz: serialized temporal detection model (includes
	                         interpretation model)
	dist/log.gz: the log from the run producing interpretModel.ser.gz
	dist/options: the command line options from the run

----------
TOUR
----------
Some key classes are explained below. This is not an exhaustive list, but
  hopefully can help give a sense for the code.

Data.scala: Utilities for working with tempeval data, or the CoreMaps created
            from them
  class TimeDataset: holds the CoreMaps corresponding to the raw data
  object DataLib:    utilities for working with the data

Detect.scala: Code for the detection model
  class CRFFeatures: the definition of the features used
  class CRFDetector: the class which performs and aggregates the detections
  class DetectionDatum: encapsulates the input information for detecting a
    time during training
  class CoreMapDetectionStore: the dataset used for detection; handles
    iterating over the data
  class DetectionTask: trains the CRF detection module

Entry.scala: General purpose code and main methods
  class Indexing: handles moving between strings and ints for POS and words
  object G: general utility fields
  object U: general utility functions
  class TimeSent: Embodies a sentence sent to the CKY parser
  trait DataStore: interface for a block of data (train/dev/test)
  *object Interpret: MAIN for running interpretation or detection for
    either our system or another system (via a CoreNLP Annotator)
  *object Entry: MAIN for training the models; this is the usual entry
    point for the program, when not evaluating

Gigaword.scala: Some old code to try to get more training instances from
                automatically parsed data

Interpret.scala: Most of the code for the interpretation model lives here
  class Grammar: defines the grammar used in CKY parsing
  object Grammar: creates the actual temporal grammar
  *class TreeTime: MODEL this is our model, which can be serialized and loaded.
    It is also a CoreNLP annotator, and does both detection and interpretation.
  class InterpretationTask: The code for training the interpretation model

O.java: All the options for training the model(s). Some of these may be dead

Tempeval2.scala: Code for processing the TempEval data into CoreMaps

Temporal.scala: The temporal representation
  trait Temporal: all temporal representations subclass this
  trait Range: all ranges subclass this (see paper)
  trait Duration: all durations subclass this (see paper)
  trait Sequence: all sequences subclass this (see paper). Note that it is
    both a Range and a Duration as well
  class Time: a point in time

  class GroundedRange: the conventional begin-to-end range (e.g., 1998)
  class UngroundedRange: a 'floating' range (e.g., today)
  class GroundedDuration: the conventional x-milliseconds type of duration
  class FuzzyDuration: an approximate duration
  class RepeatedRange: the standard range-every-n-milliseconds type of sequence
  class CompositeRange: a complex sequence that we can't evaluate analytically
  class NoTime: returned when an invalid operation is performed and no such
    time actually exists (e.g., Feb 30)
  class UnkTime: a time that we can't represent

  object Range: utility functions for handling ranges, notably intersecting
  *object Time: MAIN for running an interactive shell for evaluating temporal
    expressions. This is actually kind of a nifty domain specific language.
  object Lex: the definitions of times used in the grammar

----------
Notes
----------

  - The clases in etc/ are licensed under various licenses -- they're included
    here to make compilation easier, but care should be taken when using this
    code in any commerical setting to respect their respective licenses.