-
Notifications
You must be signed in to change notification settings - Fork 54
Home
PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig or Cascading but you don't need to know much about either of them to use it.
#What is PigPen?
- A map-reduce language that looks and behaves like clojure.core
- The ability to write map-reduce queries as programs, not scripts
- Strong support for unit tests and iterative development
The primary goal of PigPen is to take language out of the equation. PigPen operators are designed to be as close as possible to the Clojure equivalents. There are no special user defined functions (UDFs). Define Clojure functions, anonymously or named, and use them like you would in any Clojure program.
Here's the proverbial word count:
(require '[pigpen.core :as pig])
(defn word-count [lines]
(->> lines
(pig/mapcat #(-> % first
(clojure.string/lower-case)
(clojure.string/replace #"[^\w\s]" "")
(clojure.string/split #"\s+")))
(pig/group-by identity)
(pig/map (fn [[word occurrences]] [word (count occurrences)]))))
This defines a function that returns a PigPen query expression. The query takes a sequence of lines and returns the frequency that each word appears. As you can see, this is just the word count logic. We don't have to conflate external concerns, like where our data is coming from or going to.
Yep - PigPen queries are written as function compositions - data in, data out. Write it once and avoid the copy & paste routine.
Here we use our word-count function (defined above), along with a load and store command, to make a PigPen query:
(defn word-count-query [input output]
(->>
(pig/load-tsv input)
(word-count)
(pig/store-tsv output)))
This function returns the PigPen representation of the query. By itself, it won't do anything - we have to execute it locally or generate a script (more on that later).
With PigPen, you can mock input data and write a unit test for your query. No more crossing your fingers & wondering what will happen when you submit to the cluster. No more separate files for test input & output.
Mocking data is really easy. With pig/return
and pig/constantly
, you can inject arbitrary data as a starting point for your script.
A common pattern is to use pig/take
to sample a few rows of the actual source data. Wrap the result with pig/return
and you've got mock data.
(use 'clojure.test)
(deftest test-word-count
(let [data (pig/return [["The fox jumped over the dog."]
["The cow jumped over the moon."]])]
(is (= (set (pig/dump (word-count data)))
#{["moon" 1] ["jumped" 2] ["dog" 1] ["over" 2] ["cow" 1] ["fox" 1] ["the" 4]}))))
The pig/dump
operator runs the query locally.
Parameterizing your query is trivial. Any available functions, in-scope function parameters, or let bindings are available to use in functions.
(defn inc-two [x]
(+ x 2))
(defn reusable-fn [lower-bound data]
(let [upper-bound (+ lower-bound 10)]
(->> data
(pig/filter (fn [x] (< lower-bound x upper-bound)))
(pig/map inc-two))))
Note that inc-two
, lower-bound
, and upper-bound
are present when we generate the script, and are made available when the function is executed within the cluster.
Note: To exclude a local variable, add the metadata ^:local to the declaration.
Read more about closures here
Just tell PigPen where to write the query as a Pig script:
(require '[pigpen.pig])
(pigpen.pig/write-script "word-count.pig" (word-count-query "input.tsv" "output.tsv"))
And now you have a Pig script which you can submit to your cluster. The script uses pigpen.jar
, an uberjar with all of the required dependencies along with your code. The easiest way to create this jar is to build an uberjar for your project. Check out the tutorial for how to build an uberjar and run the script in Pig.
As you saw before, we can also use pig/dump
to run the query locally and return Clojure data:
=> (def data (pig/return [["The fox jumped over the dog."]
["The cow jumped over the moon."]]))
#'pigpen-wiki/data
=> (set (pig/dump (word-count data)))
#{["moon" 1] ["jumped" 2] ["dog" 1] ["over" 2] ["cow" 1] ["fox" 1] ["the" 4]}
Note: set
is used because the ordering of the resulting data is not deterministic
Getting started with Clojure and PigPen is really easy.
- The tutorial is the best way to get Clojure and PigPen installed and start writing queries
- The full API lists all of the operators with example usage
- PigPen for Clojure users is great for Clojure users new to map-reduce
- PigPen for Pig users is great for Pig users new to Clojure
- PigPen for Cascading users is great for Cascading users new to Clojure