slides/2014-02-13-mrjob.html

<!DOCTYPE html>
<html>
  <head>
    <title>Data Mining</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    <style type="text/css">
      @import url(http://fonts.googleapis.com/css?family=Droid+Serif);
      @import url(http://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);

      body {
        font-family: 'Droid Serif';
        font-size: 25px;
      }
      .remark-slide-content {
        padding: 1em 2em 1em 2em;
      }
      h1, h2, h3 {
        font-family: 'Yanone Kaffeesatz';
        font-weight: 400;
        margin-top: 0;
        margin-bottom: 0;
      }
      h1 { font-size: 3em; }
      h2 { font-size: 1.8em; }
      h3 { font-size: 1.4em; }
      .footnote {
        position: absolute;
        bottom: 3em;
      }
      ul { margin: 8px;}
      li p { line-height: 1.25em; }
      .red { color: #fa0000; }
      .large { font-size: 2em; }
      a, a > code {
        color: rgb(249, 38, 114);
        text-decoration: none;
      }
      code {
        -moz-border-radius: 3px;
        -web-border-radius: 3px;
        background: #e7e8e2;
        color: black;
        border-radius: 3px;
      }
      .tight-code {
        font-size: 20px;
      }
      .white-background {
        background-color: white;
        padding: 10px;
        display: block;
        margin-left: auto;
        margin-right: auto;
      }
      .limit-size img {
        height: auto;
        width: auto;
        max-width: 1000px;
        max-height: 500px;
       }
      em { color: #80cafa; }
      .pull-left {
        float: left;
        width: 47%;
      }
      .pull-right {
        float: right;
        width: 47%;
      }
      .pull-right ~ p {
        clear: both;
      }
      #slideshow .slide .content code {
        font-size: 1.6em;
      }
      #slideshow .slide .content pre code {
        font-size: 1.6em;
        padding: 15px;
      }
      .inverse {
        background: #272822;
        color: #e3e3e3;
        text-shadow: 0 0 20px #333;
      }
      .inverse h1, .inverse h2 {
        color: #f3f3f3;
        line-height: 1.6em;
      }

      /* Slide-specific styling */
      #slide-inverse .footnote {
        bottom: 12px;
        left: 20px;
      }
      #slide-how .slides {
        font-size: 1.6em;
        position: absolute;
        top:  151px;
        right: 140px;
      }
      #slide-how .slides h3 {
        margin-top: 0.2em;
      }
      #slide-how .slides .first, #slide-how .slides .second {
        padding: 1px 20px;
        height: 90px;
        width: 120px;
        -moz-box-shadow: 0 0 10px #777;
        -webkit-box-shadow: 0 0 10px #777;
        box-shadow: 0 0 10px #777;
      }
      #slide-how .slides .first {
        background: #fff;
        position: absolute;
        top: 20%;
        left: 20%;
        z-index: 1;
      }
      #slide-how .slides .second {
        position: relative;
        background: #fff;
        z-index: 0;
      }

      .center {
        float: center;
      }

      /* Two-column layout */
      .left-column {
        width: 48%;
        float: left;
      }
      .right-column {
        width: 48%;
        float: right;
      }
      .right-column img {
        max-width: 120%;
        max-height: 120%;
      }

      /* Tables */
      table {
        border-collapse: collapse;
        margin: 0px;
      }
      table, th, td {
        border: 1px solid white;
      }
      th, td {
        padding: 7px;
      }

    </style>
  </head>
  <body>
    <textarea id="source">


name: inverse
layout: true
class: left, top, inverse

---

# Lab: mrjob

  + Understand ```review_word_count.py```
  + Find review with most unique words
    + Fill in ```unique_review.py```
  + Find similar users
    + Write ```user_similarity.py```

???

## mrjob

  + Using the Yelp Academic Dataset
  + In lecture, we covered the steps for most unique words
  + Use Jaccard similarity for user_similarity

---

## Data

  + May copy or use in place on ischool machine
  + ischool:

```bash
~jretz/
  yelp_phoenix_academic_dataset/
    yelp_academic_dataset_review.json
```

## Agreement

  + Dataset can only be used for academic purposes
  + You can download it yourself from the [Yelp Dataset Challenge](http://www.yelp.com/dataset_challenge/)

---

## Understand review_word_count.py

  + Note, this will take ~8 minutes on the ischool machine
  + Don't run it just yet!

.tight-code[
```bash
$ python review_word_count.py yelp_academic_dataset_review.json

no configs found; falling back on auto-configuration
creating tmp directory /tmp/review_word_count.jretz.20130215.071901.095847
reading from file
> /home/jretz/src/datamining290/code/venv/bin/python review_word_count.py --step-num=0 --mapper /tmp/review_word_count.jretz.20130215.071901.095847/input_part-00000
writing to /tmp/review_word_count.jretz.20130215.071901.095847/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
...
Streaming final output from /tmp/review_word_count.jretz.20130215.071901.095847/output
"4" 2
"5" 1
"50"  1
"6" 2
"7" 2
"70s" 1
"9" 2
"a" 46
"abbey" 4
"able"  1
"about" 4
```
]

---

## A trick for running quickly while developing...

.tight-code[
```bash
$ head -n 1000 yelp_academic_dataset_review.json | \
    python review_word_count.py
```
]

+ That runs over the first 1,000 lines
+ When things start looking good, try 10,000, then the entire file

---

## Fill in unique_review.py

  + Mutli-step map reduce
  + Steps are explained in lecture
  + Skeleton in code

---

## Write user_similarity.py

  + Find users >= 0.5 similarity
  + User Similarity: Jaccard similarity of businesses reviewed
  + {BizA, BizB, BizC} ~ {BizF, BizB, BizG}

    </textarea>
    <script src="production/remark-0.5.9.min.js" type="text/javascript">
    </script>
    <script type="text/javascript">
      var slideshow = remark.create();
    </script>
  </body>
</html>