-
Exmaple:
-
Combiner: when mapper is done producing key-value pairs, do some reduction work in mapper, like aggregating data before sending to reducer to save some network bandwidth.
- ex: ./word_frequency_with_combiner.py
-
Attach config/data file with each MapReduce job across distributed nodes: ./most_popular_movie_with_name_lookup.py
-
HDFS (Hadoop Distributed File System): is used by Hadoop for distributing data and information that Hadoop accesses, YARN manages how Hadoop jobs distributed across the cluster.
-
Apache YARN (Hadoop uses to figure out what mapper/reducer to run where, how to connect them all together, keep tracking what's running, etc.)
- Python tool for big data: Enthought canopy
- mrjob package: for MapReduce Editor -> !pip install mrjob
- Sample data: http://grouplens.org/
- datasets -> MovieLens 100K Dataset (ml-100k.zip)