Data Algorithms: Recipes for Scaling up with Hadoop and Spark
I am so excited to report that the "Data Algorithms" will be going to production this month! This mean that a HARD COPY will be available very soon!
O'Reilly author book signings for Data Algorithms will be held in the O'Reilly booth #SparkSummit2015 on Tuesday, June 16, 2015 at 2:30pm.
I had an honor an privilege of signing my book at #SparkSummit 2015. I want to say big thank you to all who waited in the line to get a signed copy. What a fantastic group of engineers and data scientists! I was amazed a and learned a lot from this group. Please note that the signed copy has only the first 5 chapters. The full PDF is available and the production copy will be ready by July 2015 (I hope!).
Author book signings for ("Data Algorithms") will be held in the O'Reilly booth on Thursday, Feb. 19, 2015. Complimentary copies of books will be provided for the first 25 attendees.
The story of bonus chapters: originally, I had about 60+ chapters for the book (which would have been over 1000+ pages -- too much!!!). To keep it short, sweet, and focused, I put 31 chapters in the book and then put the remaining chapters (as bonus chapters) in here. I have started adding bonus chapters. Already have added the following bonus chapters:
Bonus chapter | Description |
---|---|
Anagram | Anagram detection in Spark and MapReduce/Hadoop |
Cartesian | How to perform "cartesian" operation in Spark |
Friend Recommendation | Friends Recommnedation algorithms in Spark and MapReduce/Hadoop |
Log Query | Basic log query implementation in Spark and MapReduce/Hadoop |
Word Count | Hello World! of Spark and MapReduce/Hadoop |
This repository will host all source code and scripts for Data Algorithms Book. This book provides a set of distributed MapReduce algrithms, which are implemented using
- Java (JDK7)
- Spark 1.4.0
- MapReduce/Hadoop 2.6.0
Please note that this is a work in progress...
- Title: Data Algorithms
- Author: Mahmoud Parsian
- Publisher: O'Reilly Media
- All source code, libraries, and build scripts are posted here
- Shell scripts are posted for running Spark and Mapreduce/Hadoop programs (in progress...)
Software | Version |
---|---|
Java | JDK7 |
Hadoop | 2.6.0 |
Spark | 1.4.0 |
Ant | 1.9.4 |
Name | Description |
---|---|
README.md | The file you are reading now |
README_lib.md | Must read before you build with Ant |
src | Source files for MapReduce/Hadoop/Spark |
scripts | Shell scripts to run MapReduce/Hadoop and Spark pograms |
lib | Required jar files for compiling source code |
build.xml | The ant build script |
dist | The ant build's output directory (creates a single JAR file) |
LICENSE | License for using this repository (Apache License, Version 2.0) |
misc | misc. files for this repository |
setenv | example of how to set your environment variables before building |
data | sample data files (such as FASTQ and FASTA) for basic testing purposes |
Also, each chapter has two sub folders:
org.dataalgorithms.chapNN.spark (for Spark programs)
org.dataalgorithms.chapNN.mapreduce (for Mapreduce/Hadoop programs)
where NN = 00, 01, ..., 31
- How To Run MapReduce/Hadoop Programs
- How To Run Java/Spark Programs in YARN
- How To Run Java/Spark Programs in Spark Cluster
To run python programs just call them with spark-submit
together with the arguments to the program.
- View Mahmoud Parsian's profile on LinkedIn
- Please send me an email: mahmoud.parsian@yahoo.com
- Twitter: @mahmoudparsian
Thank you!
best regards,
Mahmoud Parsian