Skip to content
This repository has been archived by the owner on Nov 6, 2018. It is now read-only.

A manually curated gold standard for quote extraction in literary texts

License

Notifications You must be signed in to change notification settings

mahlberg-lab/clic-gold-standard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

clic-gold-standard

This is a manually curated gold standard for quote extraction in literary texts.

It contains a number of randomly selected paragraphs from 15 novels by Charles Dickens and 29 non-Dickensian 19th century novels.

The files are XML files. Quotes are highlighted with <qs/>, <qe/>, <alt-qs/>, <alt-qe/> milestones, respectively shorthand for "quote start", "quote end", "alternative quote start", and "alternative quote end".

Remarks

The difference between <qs/> or <qe/> tags and <alt-qs/> or <alt-qe/> tags has not been manually verified. This means that a tag can mistakenly be identified as an alternative quote even if it is a normal quote. For computing precision and recall this is not an issue if one wants to measure whether quotes (regardless of whether they are alternative) are retrieved.

The gold standard is also annotates suspensions between alternative quotes.

Known issues

Recently solved issues

To do

  • add definitions

About

A manually curated gold standard for quote extraction in literary texts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published