-
Notifications
You must be signed in to change notification settings - Fork 23
Significance
We use randomised permutation/bootstrap methods to provide a statistical significance test over pairs of systems, but for many systems it will have a long runtime; we also provide a tool to calculate confidence intervals at given percentiles. Overlapping confidence intervals may indicate that system performances do not differ significantly.
Note that bootstrap resampling is performed over documents from a single system run. This makes the assumption that predictions on each document are made independently, which is certainly untrue for nil clustering, and may be untrue for linking approaches that exploit cross-document clustering.
Davison & Hinkley (1997). Bootstrap methods and their applications. Cambridge University Press.
Lin (2004). Looking for a few good metrics: ROUGE and its evaluation. In NTCIR.
Noreen (1989). Computer-intensive methods for testing hypotheses. Wiley-Interscience.
Tjong Kim Sang & De Meulder (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL.