-
Notifications
You must be signed in to change notification settings - Fork 54
Tutorial
Getting started with Clojure and PigPen is really easy. Just follow the steps below to get up and running.
- Install Leiningen
- Create a new leiningen project with
lein new pigpen-demo
. This will create a pigpen-demo folder for your project. - a. To use Pig, add PigPen as a dependency by changing the dependencies in your project's
project.clj
file to look like this:
``` clojure
:dependencies [[org.clojure/clojure "1.6.0"]
[com.netflix.pigpen/pigpen-pig "0.3.3"]]
:profiles {:dev {:dependencies [[org.apache.pig/pig "0.13.0"]
[org.apache.hadoop/hadoop-core "1.1.2"]]}}
```
b. To use Cascading, add PigPen as a dependency by changing the dependencies in your project's `project.clj` file to look like this:
``` clojure
:repositories [["conjars" "http://conjars.org/repo"]]
:dependencies [[org.clojure/clojure "1.6.0"]
[com.netflix.pigpen/pigpen-cascading "0.3.1"]]
:profiles {:dev {:dependencies [[org.apache.hadoop/hadoop-core "1.1.2"]]}}
```
- Run
lein repl
to start a REPL for your new project. - Try some samples below...
If you have any questions, or if something doesn't look quite right, contact us here: pigpen-support@googlegroups.com
Note: It is strongly recommended to familiarize yourself with Clojure before using PigPen.
Note: PigPen requires Clojure 1.5.1 or greater. The Leiningen example uses Leiningen 2.0 or greater.
To get started, we import the pigpen.core namespace:
(require '[pigpen.core :as pig])
First, lets load some data. Text files (tsv, csv) can be read using pig/load-tsv
. If you have Clojure data, take a look at pig/load-clj
.
The following code defines a function that returns a query. This query loads data from the file input.tsv.
(defn my-data []
(pig/load-tsv "input.tsv"))
Note: If you call this function, it will just return the PigPen representation of a query. To really use it, you'll need to execute it locally or convert it to a script (more on that later).
We can test our query in a REPL like so... First, create some test data:
=> (spit "input.tsv" "1\t2\tfoo\n4\t5\tbar")
And then run the script to return our data:
=> (pig/dump (my-data))
[["1" "2" "foo"] ["4" "5" "bar"]]
Now let's transform our data:
(defn my-data-1 []
(->>
(pig/load-tsv "input.tsv")
(pig/map (fn [[a b c]]
{:sum (+ (Integer/valueOf a) (Integer/valueOf b))
:name c}))))
If we run the script now, our output data reflects the transformation:
=> (pig/dump (my-data-1))
[{:sum 3, :name "foo"} {:sum 9, :name "bar"}]
And we can filter the data too:
(defn my-data-2 []
(->>
(pig/load-tsv "input.tsv")
(pig/map (fn [[a b c]]
{:sum (+ (Integer/valueOf a) (Integer/valueOf b))
:name c}))
(pig/filter (fn [{:keys [sum]}]
(< sum 5)))))
=> (pig/dump (my-data-2))
[{:sum 3, :name "foo"}]
It's generally a good practice to separate the loading of the data from our business logic. Let's separate our script into multiple functions and add a store operator:
(defn my-data-3 [input-file]
(pig/load-tsv input-file))
(defn my-func [data]
(->> data
(pig/map (fn [[a b c]]
{:sum (+ (Integer/valueOf a) (Integer/valueOf b))
:name c}))
(pig/filter (fn [{:keys [sum]}]
(< sum 5)))))
(defn my-query [input-file output-file]
(->>
(my-data-3 input-file)
(my-func)
(pig/store-clj output-file)))
Now we can define a unit test for our query:
(use 'clojure.test)
(deftest test-my-func
(let [data (pig/return [["1" "2" "foo"] ["4" "5" "bar"]])]
(is (= (pig/dump (my-func data))
[{:sum 3, :name "foo"}]))))
The function pig/dump
takes any PigPen query, executes it locally, and returns the data.
If we want to generate a script, that's easy too:
(require '[pigpen.pig])
(pigpen.pig/write-script "my-script.pig" (my-query "input.tsv" "output.clj"))
We can optionally run our script locally in Pig (if you have it installed, which is a not a requirement of PigPen). The easiest way to build the pigpen jar is to build an uberjar for our project. From the command line:
$ lein uberjar
$ cp target/pigpen-demo-0.1.0-SNAPSHOT-standalone.jar pigpen.jar
$ pig -x local -f my-script.pig
$ cat output.clj/part-m-00000
{:sum 3, :name "foo"}
Note: Pig can't overwrite files, so you'll need to delete this folder to run again. Another recommended option is to put a timestamp in the path.
See PigPen for Cascading users for how to convert a PigPen query into a Cascading flow.