Skip to content
Matt Bossenbroek edited this page Mar 5, 2014 · 7 revisions

PigPen is map-reduce for Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

#What is PigPen?

  • A map-reduce language that looks and behaves like clojure.core
  • The ability to write map-reduce queries as programs, not scripts
  • Strong support for unit tests and iterative development

Really, yet another map-reduce language?

If you know Clojure, you already know PigPen

The primary goal of PigPen is to take language out of the equation. PigPen operators are designed to be as close as possible to the Clojure equivalents. There are no special user defined functions (UDFs). Define Clojure functions, anonymously or named, and use them like you would in any Clojure program.

Here's the proverbial word count:

(require '[pigpen.core :as pig])

(defn word-count [lines]
  (->> lines
    (pig/mapcat #(-> % first
                   (clojure.string/lower-case)
                   (clojure.string/replace #"[^\w\s]" "")
                   (clojure.string/split #"\s+")))
    (pig/group-by identity)
    (pig/map (fn [[word occurrences]] [word (count occurrences)]))))

This defines a function that returns a PigPen query expression. The query takes a sequence of lines and returns the frequency that each word appears. As you can see, this is just the word count logic. We don't have to conflate external concerns, like where our data is coming from or going to.

Will it compose?

Yep - PigPen queries are written as function compositions - data in, data out. Write it once and avoid the copy & paste routine.

Here we use our word-count function (defined above), along with a load and store command, to make a PigPen query:

(defn word-count-query [input output]
  (->>
    (pig/load-tsv input)
    (word-count)
    (pig/store-tsv output)))

This function returns the PigPen representation of the query. By itself, it won't do anything - we have to execute it locally or generate a script (more on that later).

You like unit tests? Yeah, we do that

With PigPen, you can mock input data and write a unit test for your query. No more crossing your fingers & wondering what will happen when you submit to the cluster. No more separate files for test input & output.

Mocking data is really easy. With pig/return and pig/constantly, you can inject arbitrary data as a starting point for your script.

A common pattern is to use pig/take to sample a few rows of the actual source data. Wrap the result with pig/return and you've got mock data.

(use 'clojure.test)

(deftest test-word-count
  (let [data (pig/return [["The fox jumped over the dog."]
                          ["The cow jumped over the moon."]])]
    (is (= (set (pig/dump (word-count data)))
           #{["moon" 1] ["jumped" 2] ["dog" 1] ["over" 2] ["cow" 1] ["fox" 1] ["the" 4]}))))

The pig/dump operator runs the query locally.

Closures (yes, the kind with an S)

Parameterizing your query is trivial. Any available functions, in-scope function parameters, or let bindings are available to use in functions.

(defn inc-two [x]
  (+ x 2))

(defn reusable-fn [lower-bound data]
  (let [upper-bound (+ lower-bound 10)]
    (->> data
      (pig/filter (fn [x] (< lower-bound x upper-bound)))
      (pig/map inc-two))))

Note that inc-two, lower-bound, and upper-bound are present when we generate the script, and are made available when the function is executed within the cluster.

Note: To exclude a local variable, add the metadata ^:local to the declaration.

Read more about closures here

So how do I use it?

Just tell PigPen where to write the query as a Pig script:

(pig/write-script "word-count.pig" (word-count-query "input.tsv" "output.tsv"))

And now you have a Pig script which you can submit to your cluster. The script uses pigpen.jar, an uberjar with all of the required dependencies along with your code. The easiest way to create this jar is to build an uberjar for your project. Check out the tutorial for how to build an uberjar and run the script in Pig.

As you saw before, we can also use pig/dump to run the query locally and return Clojure data:

=> (def data (pig/return [["The fox jumped over the dog."]
                          ["The cow jumped over the moon."]]))
#'pigpen-wiki/data

=> (set (pig/dump (word-count data)))
#{["moon" 1] ["jumped" 2] ["dog" 1] ["over" 2] ["cow" 1] ["fox" 1] ["the" 4]}

Note: set is used because the ordering of the resulting data is not deterministic

What's next?

Getting started with Clojure and PigPen is really easy.

  • The wiki explains what PigPen does and why we made it
  • The tutorial is the best way to get Clojure and PigPen installed and start writing queries
  • The full API lists all of the operators with example usage
  • PigPen for Clojure users is great for Clojure users new to map-reduce
  • PigPen for Pig users is great for Pig users new to Clojure