data

Dean Wampler

Added an alternative, text-only version of the stop words.

Jul 7, 2015

65452e6 · Jul 7, 2015

Name	Name	Last commit message	Last commit date
parent directory ..
enron-spam-ham	enron-spam-ham	Made data files non-executable, which triggered misinterpretation of …	Jan 27, 2015
hive-kjv	hive-kjv	Final commits for the HDP 2.1	Oct 6, 2014
shakespeare	shakespeare	First commit.	May 1, 2014
README.md	README.md	Tweaked the data/README.md	Feb 5, 2015
abbrevs-to-names.tsv	abbrevs-to-names.tsv	First commit.	May 1, 2014
apodat.txt	apodat.txt	First commit.	May 1, 2014
gallic.mb.txt	gallic.mb.txt	Added Caesar's Gallic Wars.	Feb 5, 2015
kjvdat.txt	kjvdat.txt	First commit.	May 1, 2014
sept.txt	sept.txt	First commit.	May 1, 2014
stop-words.txt	stop-words.txt	Added an alternative, text-only version of the stop words.	Jul 7, 2015
t3utf.dat	t3utf.dat	First commit.	May 1, 2014
ugntdat.txt	ugntdat.txt	First commit.	May 1, 2014
vuldat.txt	vuldat.txt	First commit.	May 1, 2014

README.md

README for the data directory

When you download this example from GitHub, you'll need to copy the data to HDFS. Use the following command. Note that last argument; the examples in this tutorial assume the data is located in the location shown. Change it as you see fit, but you'll have to pass arguments to the programs to specify the new location:

hadoop fs -put data /user/$USER/data

This step has already been done for prepackaged Hadoop distribution virtual machines.

Hence, the following discussion applies to both data locations.

Sacred Texts

The following ancient, sacred texts are from www.sacred-texts.com/bib/osrc/. All are copyright-free texts, where each verse is on a separate line, prefixed by the book name, chapter, number, and verse number, all "|" separated.

File	Description
`kjvdat.txt`	The King James Version of the Bible. For some reason, each line (one per verse) ends with a "~" character.
`t3utf.dat`	Tanach, the Hebrew Bible.
`vuldat.txt`	The Latin Vulgate.
`sept.txt`	The Septuagint (Koine Greek of the Hebrew Old Testament).
`ugntdat.txt`	The Greek New Testament.
`apodat.txt`	The Apocrypha (in English).
`abbrevs-to-names.tsv`	A map from the book abbreviations used in these texts to the full book names. Derived using data from the sacred-texts.com site.

There are many other texts from the world's religious traditions at the www.sacred-texts.com site, but most of the others aren't formatted into one convenient file like these examples.

Here are Hive DDL statements for these files, if you want to put them into Hive.

For example, this DDL statement can be used for the data/kjvdat.txt file, where I'll assume you've copied the file to a directory hdfs://server/data/kjvdat in HDFS, which requires directory paths rather than file names, where server is the server name or IP address for the NameNode.

CREATE EXTERNAL TABLE IF NOT EXISTS kjv (
  book    STRING,
  chapter INT,
  verse   INT,
  text    STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 'hdfs://server/user/<USER>/data/kjvdat';

Actually the hdfs://server prefix can be omitted. Hive will infer the correct file system type based on its configuration. The <USER> must be replaced with the actual user name.

The same DDL can be used for the other files mentioned above, except for the name map file. abbrevs-to-names.tsv. Here is a DDL statement for the latter, assuming the file is copied to a directory hdfs://server/data/abbrevs_to_names in HDFS.

CREATE EXTERNAL TABLE IF NOT EXISTS abbrevs_to_names (
  abbrev  STRING,
  book    STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'hdfs://server/user/<USER>/data/abbrevs_to_names';

Note that the field delimiter is tab, not "|".

Julius Caesar's "Gallic Wars"

gallic-mb-txt is an English translation of Julius Caesar's famous memoir, Gallic Wars about his conquest of Gaul (roughly modern France, the French part of Switzerland, and parts of Germany).

Email Classified as SPAM and HAM

A sample of SPAM/HAM classified emails from the well-known Enron email data set was adapted from this research project. Each file is plain text, partially formatted (i.e., with name:value headers) as used in email servers and clients.

Directory	Description
`enron-spam-hamham100`	A sample of 100 emails from the dataset that were classified as HAM.
`enron-spam-hamspam100`	A sample of 100 emails from the dataset that were classified as SPAM.

If you want load as raw text, one line per "record", use this Hive DDL, where we define two partitions, one for HAM and one for SPAM. We assume you have copied the data/enron-spam-ham directory to HDFS at hdfs://server/data/enron-spam-ham:

CREATE EXTERNAL TABLE IF NOT EXISTS mail (line STRING)
PARTITIONED BY (is_spam BOOLEAN);

ALTER TABLE mail ADD PARTITION(is_spam = true)
LOCATION 'hdfs://server/data/user/<USER>/enron-spam-ham/spam100';

ALTER TABLE mail ADD PARTITION(is_spam = false)
LOCATION 'hdfs://server/data/user/<USER>/enron-spam-ham/ham100';

Note that you could reformat the files into structured records to do more sophisticated processing of emails, such as separating out the headers, the "to:", "cc:", "bcc:", and the body. For example, the headers could stored in a Hive MAP and the recipients could be stored in ARRAYs.

Shakespeare's Plays

The plain-text version of all of Shakespeare's plays, formatted exactly as you typically see them printed, i.e., using the conventional spacing and layout for plays.

Directory	Description
`shakespeare/all-shakespeare.txt`	The folio of Shakespeare's plays, as plain text.

To use from Hive as a source of unstructured text:

CREATE EXTERNAL TABLE IF NOT EXISTS shakespeare (line STRING)
LOCATION 'hdfs://server/user/<USER>/data/shakespeare';

Return to the project tutorial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

data

data

README.md

README for the data directory

Sacred Texts

Julius Caesar's "Gallic Wars"

Email Classified as SPAM and HAM

Shakespeare's Plays

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

README for the data directory

Sacred Texts

Julius Caesar's "Gallic Wars"

Email Classified as SPAM and HAM

Shakespeare's Plays