GitHub - notnews/nytimes-corpus-extractor: Extract all the fields from the NY Times Corpus to a csv

Extract All the Fields from the New York Times Corpus to a CSV

The New York Times Corpus is a collection of 1.8 million articles published between 1987 and 2007 along with a fair bit of meta data. For more details about The NY Times Corpus, see https://catalog.ldc.upenn.edu/LDC2008T19.

Once you have the NY Times Corpus, unzip it to a folder. And then run the script. Script produces a csv and text files containing story text.

Requirements

Python 2.x

Installation

To install the dependency lxml 3.1.1:

pip install -r requirements.txt

Usage

python nytextract.py [options] <xml directory>

Options:


  -h, --help            show this help message and exit
  -a, --append          Append if existing (default: False)
  -o OUTFILE, --out=OUTFILE
                        CSV output file (default: outfile.csv)
  -d OUTDIR, --dir=OUTDIR
                        Text output directory (default: text)

Example To process all XML files in the folder 2000 (carrying files from year 2000):

python nytextract.py -o 2000.csv -d text 2000

The script will generate a CSV "2000.csv". Story text files will be stored in a folder "text." This folder will have the exact same structure as the folder '2000.'

License

Scripts are released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Readme.md		Readme.md
nytextract.py		nytextract.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract All the Fields from the New York Times Corpus to a CSV

Requirements

Installation

Usage

License

About

Releases

Contributors 5

Languages

notnews/nytimes-corpus-extractor

Folders and files

Latest commit

History

Repository files navigation

Extract All the Fields from the New York Times Corpus to a CSV

Requirements

Installation

Usage

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Contributors 5

Languages