Skip to content

Subqueries

Matt Bossenbroek edited this page Dec 16, 2014 · 2 revisions

Pig has a concept of subqueries, which allow for processing inner bags (the result of grouping) directly within the script. PigPen supports subqueries, but it doesn't use Pig's nested inner bag syntax to execute them. In fact, most of the time, you won't even notice the difference.

In this example, say we want to find the row with the greatest timestamp for each grouping of the first column:

(require '[clojure.test :refer :all]
         '[pigpen.core :as pig]
         '[pigpen.fold :as fold])

(deftest test-subquery
  (let [command (->>
                  (pig/return [["A" 10 20110101235900]
                               ["A" 11 20110101235959]
                               ["A" 12 20110101230059]
                               ["B" 20 20110201010000]
                               ["B" 21 20110202010000]
                               ["C" 30 20110301030000]])
                  (pig/group-by first
                    {:fold (fold/max-key last)})
                  (pig/map (fn [[_ max]] max)))]
    (is (= (set (pig/dump command))
           #{["A" 11 20110101235959]
             ["B" 21 20110202010000]
             ["C" 30 20110301030000]}))))

This folds each group and uses fold/max-key to compute the row with the highest timestamp. That means that the operation is done mostly in the mappers - each one computes the max for each of its groups, leaving the reducers with much less work to do.

As to Pig's nested inner bag syntax, it really isn't buying us anything - it's just syntactic sugar in Pig. There's no performance advantage over simply writing a Clojure function that PigPen consumes via a Pig UDF. What you can do in a nested block is very limited, and for interesting queries, you often have to fall back to using UDFs anyway. Also, the Pig scripts generated by PigPen aren't intended to be edited or maintained by humans, so using a slightly different syntax to accomplish the same task didn't seem advantageous.

On the contrary, since PigPen does everything in a UDF, you can do anything in your function. Any Clojure function is fair game, making it much more flexible than Pig's inner bag syntax. You can use the full power of Clojure anywhere in your script.