Tutorial

Getting started with Clojure and PigPen is really easy. Just follow the steps below to get up and running.

Install Leiningen
Create a new leiningen project with lein new pigpen-demo. This will create a pigpen-demo folder for your project.
a. To use Pig, add PigPen as a dependency by changing the dependencies in your project's project.clj file to look like this:

``` clojure
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [com.netflix.pigpen/pigpen-pig "0.3.3"]]
  :profiles {:dev {:dependencies [[org.apache.pig/pig "0.13.0"]
                                  [org.apache.hadoop/hadoop-core "1.1.2"]]}}
```

 b. To use Cascading, add PigPen as a dependency by changing the dependencies in your project's `project.clj` file to look like this:

``` clojure
  :repositories [["conjars" "http://conjars.org/repo"]]
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [com.netflix.pigpen/pigpen-cascading "0.3.1"]]
  :profiles {:dev {:dependencies [[org.apache.hadoop/hadoop-core "1.1.2"]]}}
```

Run lein repl to start a REPL for your new project.
Try some samples below...

If you have any questions, or if something doesn't look quite right, contact us here: pigpen-support@googlegroups.com

Note: It is strongly recommended to familiarize yourself with Clojure before using PigPen.

Note: PigPen requires Clojure 1.5.1 or greater. The Leiningen example uses Leiningen 2.0 or greater.

To get started, we import the pigpen.core namespace:

(require '[pigpen.core :as pig])

First, lets load some data. Text files (tsv, csv) can be read using pig/load-tsv. If you have Clojure data, take a look at pig/load-clj.

The following code defines a function that returns a query. This query loads data from the file input.tsv.

(defn my-data []
  (pig/load-tsv "input.tsv"))

Note: If you call this function, it will just return the PigPen representation of a query. To really use it, you'll need to execute it locally or convert it to a script (more on that later).

We can test our query in a REPL like so... First, create some test data:

=> (spit "input.tsv" "1\t2\tfoo\n4\t5\tbar")

And then run the script to return our data:

=> (pig/dump (my-data))
[["1" "2" "foo"] ["4" "5" "bar"]]

Now let's transform our data:

(defn my-data-1 []
  (->>
    (pig/load-tsv "input.tsv")
    (pig/map (fn [[a b c]]
               {:sum (+ (Integer/valueOf a) (Integer/valueOf b))
                :name c}))))

If we run the script now, our output data reflects the transformation:

=> (pig/dump (my-data-1))
[{:sum 3, :name "foo"} {:sum 9, :name "bar"}]

And we can filter the data too:

(defn my-data-2 []
  (->>
    (pig/load-tsv "input.tsv")
    (pig/map (fn [[a b c]]
               {:sum (+ (Integer/valueOf a) (Integer/valueOf b))
                :name c}))
    (pig/filter (fn [{:keys [sum]}]
                  (< sum 5)))))

=> (pig/dump (my-data-2))
[{:sum 3, :name "foo"}]

It's generally a good practice to separate the loading of the data from our business logic. Let's separate our script into multiple functions and add a store operator:

(defn my-data-3 [input-file]
  (pig/load-tsv input-file))

(defn my-func [data]
  (->> data
    (pig/map (fn [[a b c]]
               {:sum (+ (Integer/valueOf a) (Integer/valueOf b))
                :name c}))
    (pig/filter (fn [{:keys [sum]}]
                  (< sum 5)))))

(defn my-query [input-file output-file]
  (->>
    (my-data-3 input-file)
    (my-func)
    (pig/store-clj output-file)))

Now we can define a unit test for our query:

(use 'clojure.test)

(deftest test-my-func
  (let [data (pig/return [["1" "2" "foo"] ["4" "5" "bar"]])]
    (is (= (pig/dump (my-func data))
           [{:sum 3, :name "foo"}]))))

The function pig/dump takes any PigPen query, executes it locally, and returns the data.

If we want to generate a script, that's easy too:

(require '[pigpen.pig])

(pigpen.pig/write-script "my-script.pig" (my-query "input.tsv" "output.clj"))

We can optionally run our script locally in Pig (if you have it installed, which is a not a requirement of PigPen). The easiest way to build the pigpen jar is to build an uberjar for our project. From the command line:

$ lein uberjar
$ cp target/pigpen-demo-0.1.0-SNAPSHOT-standalone.jar pigpen.jar
$ pig -x local -f my-script.pig
$ cat output.clj/part-m-00000
{:sum 3, :name "foo"}

Note: Pig can't overwrite files, so you'll need to delete this folder to run again. Another recommended option is to put a timestamp in the path.

See PigPen for Cascading users for how to convert a PigPen query into a Cascading flow.

A Netflix Original Production

PigPen Support & Questions

Tech Blog | Twitter @NetflixOSS | Jobs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tutorial

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally