proceedings/dataset at master · afrozas/proceedings

README.md

This is a dummy directory to preview the directory structure of the dataset. A sample processed json object is available here.

Download the dataset here.

The current release of the dataset has documents processed from the following conference proceedings.

NeurIPS	EMNLP	ACL	InterSpeech
-	-	-	-
2019	2019	2019	2019
2018	-	2018	2018
2017	-	2017	2017
2016	-	2016	-
2015	-	2015	-

Note: Few files from certain proceedings are dropped from the dataset due to parsing errors.

The dataset contains the following fields extracted from each document.

Semantically extracted fields using a nltk and spacy pipeline.
- entities: Named Entity Recognition is performed on the document text to extract text span as entites and tag them with entity_type.
- tags: Part of Speech tagged tokens extracted from document text.
- parser: Dependency Parsing between text spans of document.
- noun_chunks: Base noun phrases that have a noun as their head.
Metadata fields extracted using PyPDF2

filename metadata numPages

title author subject

creator producer keywords

creationdate moddate trapped

ptexfullbanner raw_text -