Data Algorithms Book

Data Algorithms: Recipes for Scaling up with Hadoop and Spark

Added a New Bonus Chapter: Rank Product

Production Version is Available NOW!

Going to Production

I am so excited to report that the "Data Algorithms" will be going to production this month! This mean that a HARD COPY will be available very soon!

Author Book Signing @Spark Summit 2015

O'Reilly author book signings for Data Algorithms will be held in the O'Reilly booth #SparkSummit2015 on Tuesday, June 16, 2015 at 2:30pm.

Thank YOU!

I had an honor an privilege of signing my book at #SparkSummit 2015. I want to say big thank you to all who waited in the line to get a signed copy. What a fantastic group of engineers and data scientists! I was amazed a and learned a lot from this group. Please note that the signed copy has only the first 5 chapters. The full PDF is available and the production copy will be ready by July 2015 (I hope!).

Author Book Signing

Author book signings for ("Data Algorithms") will be held in the O'Reilly booth on Thursday, Feb. 19, 2015. Complimentary copies of books will be provided for the first 25 attendees.

Bonus Chapters

The story of bonus chapters: originally, I had about 60+ chapters for the book (which would have been over 1000+ pages -- too much!!!). To keep it short, sweet, and focused, I put 31 chapters in the book and then put the remaining chapters (as bonus chapters) in here. I have started adding bonus chapters. Already have added the following bonus chapters:

Bonus chapter	Description
Anagram	Anagram detection in Spark and MapReduce/Hadoop
Cartesian	How to perform "cartesian" operation in Spark
Friend Recommendation	Friends Recommnedation algorithms in Spark and MapReduce/Hadoop
Log Query	Basic log query implementation in Spark and MapReduce/Hadoop
Word Count	Hello World! of Spark and MapReduce/Hadoop

Repository

This repository will host all source code and scripts for Data Algorithms Book. This book provides a set of distributed MapReduce algrithms, which are implemented using

Java (JDK7)
Spark 1.4.0
MapReduce/Hadoop 2.6.0

Work in Progress...

Please note that this is a work in progress...

URL To Data Algorithms Book

Title: Data Algorithms
Author: Mahmoud Parsian
Publisher: O'Reilly Media

Source Code

All source code, libraries, and build scripts are posted here
Shell scripts are posted for running Spark and Mapreduce/Hadoop programs (in progress...)

Software Used

Software	Version
Java	JDK7
Hadoop	2.6.0
Spark	1.4.0
Ant	1.9.4

Structure of Repository

Name	Description
README.md	The file you are reading now
README_lib.md	Must read before you build with Ant
src	Source files for MapReduce/Hadoop/Spark
scripts	Shell scripts to run MapReduce/Hadoop and Spark pograms
lib	Required jar files for compiling source code
build.xml	The ant build script
dist	The ant build's output directory (creates a single JAR file)
LICENSE	License for using this repository (Apache License, Version 2.0)
misc	misc. files for this repository
setenv	example of how to set your environment variables before building
data	sample data files (such as FASTQ and FASTA) for basic testing purposes

Structure of src Directory

Also, each chapter has two sub folders:

org.dataalgorithms.chapNN.spark      (for Spark programs)
org.dataalgorithms.chapNN.mapreduce  (for Mapreduce/Hadoop programs)

where NN = 00, 01, ..., 31

How To Build using Apache's Ant

How To Build by Ant

Sample Builds by Ant

Sample Build by Ant for MacBook
Sample Build by Ant for Linux

How To Run Spark/Hadoop Programs

How To Run MapReduce/Hadoop Programs
How To Run Java/Spark Programs in YARN
How To Run Java/Spark Programs in Spark Cluster

Submit a Spark job to YARN from Java Code

How To Run Python Programs

To run python programs just call them with spark-submit together with the arguments to the program.

Questions/Comments

View Mahmoud Parsian's profile on LinkedIn
Please send me an email: mahmoud.parsian@yahoo.com
Twitter: @mahmoudparsian

Thank you!

best regards,
Mahmoud Parsian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Algorithms Book

Added a New Bonus Chapter: Rank Product

Production Version is Available NOW!

Going to Production

Author Book Signing @Spark Summit 2015

Thank YOU!

Author Book Signing

Bonus Chapters

Repository

Work in Progress...

URL To Data Algorithms Book

Source Code

Software Used

Structure of Repository

Structure of src Directory

How To Build using Apache's Ant

Sample Builds by Ant

How To Run Spark/Hadoop Programs

Submit a Spark job to YARN from Java Code

How To Run Python Programs

Questions/Comments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Algorithms Book

Added a New Bonus Chapter: Rank Product

Production Version is Available NOW!

Going to Production

Author Book Signing @Spark Summit 2015

Thank YOU!

Author Book Signing

Bonus Chapters

Repository

Work in Progress...

URL To Data Algorithms Book

Source Code

Software Used

Structure of Repository

Structure of src Directory

How To Build using Apache's Ant

Sample Builds by Ant

How To Run Spark/Hadoop Programs

Submit a Spark job to YARN from Java Code

How To Run Python Programs

Questions/Comments