Skip to content

EdinburghNLP/scriptbase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

The ScriptBase Corpus

If you use this data, please cite either or both of the following papers:

If you are using ScriptBase-alpha:

Philip John Gorinski and Mirella Lapata (2015). Movie Script Summarization as Graph-based Scene Extraction. In Proceedings of NAACL-HLT 2015, Denver, Colorado, USA.

If you are using ScriptBase-j:

Philip John Gorinski and Mirella Lapata (2018). What's this Movie about? A Joint Neural Network Architecture for Movie Content Analysis. In Proceedings of NAACL-HLT 2018, New Orleans, Louisiana, USA.

The ScriptBase Corpus is split in two parts:

ScriptBase-alpha: The first crawl of movie scripts

ScriptBase-J: Additional meta data from Jinni

ScriptBase-alpha can be found in the scriptbase_alpha folder.

It contains .tar.gz archives with the following data for 1,276 movies:

  • script.htm / script.html - in cases where the script was crawled from a web-page in HTML format
  • script.txt - plain-text version of the movie script
  • wiki.html - the movie's Wikipedia[1] page (2014 dump)
  • imdb.html - the movie's main IMDB[2] page (2014 dump)
  • keywords.html - the movie's IMDB keywords page
  • credits.html - the movie's IMDB credits page
  • summary.html - the movie's IMDB summaries page
  • synopsis.html - the movie's IMDB synopsis page
  • taglines.html - the movie's IMDB taglines page
  • processed/imdb_meta - meta data extracted from IMDB
  • processed/logTag.txt - the movie's log line(s) and tag line(s), if it has any
  • processed/wikiplot.txt - plain-text version of Wikipedia's plot section for the movie
  • processed/summaries/ - folder containing plain-text versions of the movie's IMDB summaries (if any)

ScriptBase-J can be found in the scriptbase_jinni folder.

It contains .tar.gz archives with the following additional data for 917 movies:

  • jinni.html - the movie's Jinni[3] page (2015 dump)
  • processed/script_clean.txt - plain-text version of the movie script, manually corrected for inconsistencies
  • processed/script.xml - XML version of the movie script, with various automatic annotations
  • processed/profile.txt - Jinni's movie profile in plain-text format
  • processed/genes.txt - all Jinni genes (attribute-value pairs) for the movie

References

[1] https://en.wikipedia.org/

[2] https://www.imdb.com/

[3] http://www.jinni.com/

Releases

No releases published

Packages

No packages published