Skip to content

Files

Latest commit

65452e6 · Jul 7, 2015

History

History

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jan 27, 2015
Oct 6, 2014
May 1, 2014
Feb 5, 2015
May 1, 2014
May 1, 2014
Feb 5, 2015
May 1, 2014
May 1, 2014
Jul 7, 2015
May 1, 2014
May 1, 2014
May 1, 2014

README for the data directory

When you download this example from GitHub, you'll need to copy the data to HDFS. Use the following command. Note that last argument; the examples in this tutorial assume the data is located in the location shown. Change it as you see fit, but you'll have to pass arguments to the programs to specify the new location:

hadoop fs -put data /user/$USER/data

This step has already been done for prepackaged Hadoop distribution virtual machines.

Hence, the following discussion applies to both data locations.

Sacred Texts

The following ancient, sacred texts are from www.sacred-texts.com/bib/osrc/. All are copyright-free texts, where each verse is on a separate line, prefixed by the book name, chapter, number, and verse number, all "|" separated.

File Description
kjvdat.txt The King James Version of the Bible. For some reason, each line (one per verse) ends with a "~" character.
t3utf.dat Tanach, the Hebrew Bible.
vuldat.txt The Latin Vulgate.
sept.txt The Septuagint (Koine Greek of the Hebrew Old Testament).
ugntdat.txt The Greek New Testament.
apodat.txt The Apocrypha (in English).
abbrevs-to-names.tsv A map from the book abbreviations used in these texts to the full book names. Derived using data from the sacred-texts.com site.

There are many other texts from the world's religious traditions at the www.sacred-texts.com site, but most of the others aren't formatted into one convenient file like these examples.

Here are Hive DDL statements for these files, if you want to put them into Hive.

For example, this DDL statement can be used for the data/kjvdat.txt file, where I'll assume you've copied the file to a directory hdfs://server/data/kjvdat in HDFS, which requires directory paths rather than file names, where server is the server name or IP address for the NameNode.

CREATE EXTERNAL TABLE IF NOT EXISTS kjv (
  book    STRING,
  chapter INT,
  verse   INT,
  text    STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 'hdfs://server/user/<USER>/data/kjvdat';

Actually the hdfs://server prefix can be omitted. Hive will infer the correct file system type based on its configuration. The <USER> must be replaced with the actual user name.

The same DDL can be used for the other files mentioned above, except for the name map file. abbrevs-to-names.tsv. Here is a DDL statement for the latter, assuming the file is copied to a directory hdfs://server/data/abbrevs_to_names in HDFS.

CREATE EXTERNAL TABLE IF NOT EXISTS abbrevs_to_names (
  abbrev  STRING,
  book    STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'hdfs://server/user/<USER>/data/abbrevs_to_names';

Note that the field delimiter is tab, not "|".

Julius Caesar's "Gallic Wars"

gallic-mb-txt is an English translation of Julius Caesar's famous memoir, Gallic Wars about his conquest of Gaul (roughly modern France, the French part of Switzerland, and parts of Germany).

Email Classified as SPAM and HAM

A sample of SPAM/HAM classified emails from the well-known Enron email data set was adapted from this research project. Each file is plain text, partially formatted (i.e., with name:value headers) as used in email servers and clients.

Directory Description
enron-spam-hamham100 A sample of 100 emails from the dataset that were classified as HAM.
enron-spam-hamspam100 A sample of 100 emails from the dataset that were classified as SPAM.

If you want load as raw text, one line per "record", use this Hive DDL, where we define two partitions, one for HAM and one for SPAM. We assume you have copied the data/enron-spam-ham directory to HDFS at hdfs://server/data/enron-spam-ham:

CREATE EXTERNAL TABLE IF NOT EXISTS mail (line STRING)
PARTITIONED BY (is_spam BOOLEAN);

ALTER TABLE mail ADD PARTITION(is_spam = true)
LOCATION 'hdfs://server/data/user/<USER>/enron-spam-ham/spam100';

ALTER TABLE mail ADD PARTITION(is_spam = false)
LOCATION 'hdfs://server/data/user/<USER>/enron-spam-ham/ham100';

Note that you could reformat the files into structured records to do more sophisticated processing of emails, such as separating out the headers, the "to:", "cc:", "bcc:", and the body. For example, the headers could stored in a Hive MAP and the recipients could be stored in ARRAYs.

Shakespeare's Plays

The plain-text version of all of Shakespeare's plays, formatted exactly as you typically see them printed, i.e., using the conventional spacing and layout for plays.

Directory Description
shakespeare/all-shakespeare.txt The folio of Shakespeare's plays, as plain text.

To use from Hive as a source of unstructured text:

CREATE EXTERNAL TABLE IF NOT EXISTS shakespeare (line STRING)
LOCATION 'hdfs://server/user/<USER>/data/shakespeare';

Return to the project tutorial.