-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUBE/ROLLUP in PigPen #134
Comments
Glad you're enjoying it! I just read about cube/rollup this morning, but it seems pretty straightforward. From how I read it, it looks like it's just a mapcat operation to expand a group key into a sentinel version of the key:
Then you'd just group-by that set of keys instead of the original one. Here's a plain clojure example I was playing with: (defn cube []
(let [sentinel '*
cube-key (fn [& fns]
(let [key-fn (apply juxt fns)]
(fn [value]
(let [key (key-fn value)]
(cons
[key value]
(for [i (range (count key))]
[(assoc key i sentinel) value]))))))]
(->> [{:a 1, :b 2, :c 3}
{:a 1, :b 2, :c 4}
{:a 2, :b 3, :c 5}]
(mapcat (cube-key :a :b))
(group-by first)
(map (fn [[k vs]] [k (map second vs)]))))) To make this run in pigpen today, just add And then the rollup version seems to be just a different function of the grouping key: (defn rollup []
(let [sentinel '*
rollup-key (fn [& fns]
(let [key-fn (apply juxt fns)]
(fn [value]
(let [key (key-fn value)]
(reductions
(fn [[k v] i]
[(assoc k (dec i) '*) v])
[key value]
(range (count key) 0 -1))))))]
(->> [{:a 1, :b 2, :c 3}
{:a 1, :b 2, :c 4}
{:a 2, :b 3, :c 5}]
(mapcat (rollup-key :a :b))
(group-by first)
(map (fn [[k vs]] [k (map second vs)]))))) So, that's what we want to do, now how & where do we implement this? We could implement these to use Pig's CUBE/ROLLUP operators, but I doubt that they're doing anything special and it would block us from using these in other platforms. Since we can trivially implement these using the existing operators (and make them more flexible), let's stick with that for now. I can follow up & make sure that pig isn't doing any additional optimizations here that we would miss out on. We could add cube-by and rollup-by operators as described above, which are just compositions of existing operators, but I generally try to stay away from adding a lot of new, special purpose operators that aren't in I think a better generalization would be a Then we'd end up with a commands like this:
That said, you were right when you alluded to a different platform - we are about to release a cascading version of pigpen. This is probably the kick in the butt I need to finish that work. :) Of course, the remaining changes I have to make to that branch would most certainly break whatever work you would do to add these new operators - go figure. Apologies for the long-winded ness of this response, but if you're still interested, take a look at join.clj where the existing Until then, the first example described above would certainly work as-is today. That would get you started without having to change anything. Thoughts? Thanks, |
Wow, thank you for so detailed and fast response! You're right that it's simple to imitate that same behavior, just wanted to make sure I wasn't going to reinvent the wrong wheel. I'll try to implement cube/rollup solution sometime this week and will get back to you with the results. Thanks again! |
This project just blew my mind! I'm super excited to find it, but.. Are there any plans to introduce
CUBE
/ROLLUP
functions? Are these on a roadmap?I'd be happy to contribute, just that I'm not sure yet whether it won't contradict any future plans (i.e. migration to other backend; or this will be separate project?)
Thank you!
The text was updated successfully, but these errors were encountered: