Skip to content

prince6635/movie-ratings-by-mapreduce-and-hadoop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big data (movie ratings) based on Hadoop and MapReduce

MapReduce

  • Exmaple:

    • how many movies that each user has watched? => key: user_id and value: movie_id, now duplicate keys are ok, since reducer will handle that later. Sample data

    Map: Map I

    Map II

    Map III

    Reduce: Reduce

    All: All

    • Code snippet: # of movies for each rating?
      • Fields: user_id movie_id rating timestamp All
  • Combiner: when mapper is done producing key-value pairs, do some reduction work in mapper, like aggregating data before sending to reducer to save some network bandwidth.

    • ex: ./word_frequency_with_combiner.py
  • Attach config/data file with each MapReduce job across distributed nodes: ./most_popular_movie_with_name_lookup.py

  • How MapReduce scales / distributed computing:
    Distributed

Hadoop (Run MapReduce job in a distributed way)

Hadoop

  • HDFS (Hadoop Distributed File System): is used by Hadoop for distributing data and information that Hadoop accesses, YARN manages how Hadoop jobs distributed across the cluster. HDFS

  • Apache YARN (Hadoop uses to figure out what mapper/reducer to run where, how to connect them all together, keep tracking what's running, etc.) YARN

  • AWS Elastic MapReduce EMR

Tools

  • Python tool for big data: Enthought canopy
    • mrjob package: for MapReduce Editor -> !pip install mrjob
  • Sample data: http://grouplens.org/
    • datasets -> MovieLens 100K Dataset (ml-100k.zip)

About

Big data (movie ratings) based on Hadoop and MapReduce

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published