Skip to content

Latest commit

 

History

History

dataset

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Dataset

This is a dummy directory to preview the directory structure of the dataset. A sample processed json object is available here.

Download the dataset here.

Data

The current release of the dataset has documents processed from the following conference proceedings.

NeurIPS EMNLP ACL InterSpeech
- - - -
2019 2019 2019 2019
2018 - 2018 2018
2017 - 2017 2017
2016 - 2016 -
2015 - 2015 -

Note: Few files from certain proceedings are dropped from the dataset due to parsing errors.

Features

The dataset contains the following fields extracted from each document.

  • Semantically extracted fields using a nltk and spacy pipeline.
    • entities: Named Entity Recognition is performed on the document text to extract text span as entites and tag them with entity_type.
    • tags: Part of Speech tagged tokens extracted from document text.
    • parser: Dependency Parsing between text spans of document.
    • noun_chunks: Base noun phrases that have a noun as their head.
  • Metadata fields extracted using PyPDF2
    filename metadata numPages
    title author subject
    creator producer keywords
    creationdate moddate trapped
    ptexfullbanner raw_text -