This is a dummy directory to preview the directory structure of the dataset. A sample processed json object is available here.
Download the dataset here.
The current release of the dataset has documents processed from the following conference proceedings.
NeurIPS | EMNLP | ACL | InterSpeech |
---|---|---|---|
- | - | - | - |
2019 | 2019 | 2019 | 2019 |
2018 | - | 2018 | 2018 |
2017 | - | 2017 | 2017 |
2016 | - | 2016 | - |
2015 | - | 2015 | - |
Note: Few files from certain proceedings are dropped from the dataset due to parsing errors.
The dataset contains the following fields extracted from each document.
- Semantically extracted fields using a nltk and spacy pipeline.
entities
: Named Entity Recognition is performed on the document text to extract text span as entites and tag them with entity_type.tags
: Part of Speech tagged tokens extracted from document text.parser
: Dependency Parsing between text spans of document.noun_chunks
: Base noun phrases that have a noun as their head.
- Metadata fields extracted using PyPDF2
filename
metadata
numPages
title
author
subject
creator
producer
keywords
creationdate
moddate
trapped
ptexfullbanner
raw_text
-