-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cascading-hadoop2-mr1 by default #160
Comments
I'm fine with upgrading. It's using hadoop 1 because that's what existed when I originally wrote PigPen and it just got copied around. On the pig side, it's compiled as a provided dependency, so you can really use whatever you want on the cluster. Would it make sense to change the reference in pigpen-cascading to |
Also, I imagine that targeting hadoop 2 isn't going to break backwards compatibility for users still running hadoop 1 cluster? |
Cascading should be in the jar you submit, hadoop is meant to be set to provided. |
Cool - I can make that change. I'll update the pig version at the same time. Is there a specific version you'd like to see? |
the thing is a little bit more complicated: Cascading has a notion of platforms, on which it can run. Each platform is the abstraction of a runtime, it has it's own FlowConnector etc etc. For Cascading 2.x, we have 3 platforms:
In Cascading 3, we will have more platforms, notably
Others are in the works, from what I know. The thing is, that those platforms are mutually exclusive. While you can move one cascading app onto a new platform by changing the flow connector, they cannot all be on the classpath at once. A user has to ship with one. To make a long answer short, I would move to cascading-hadoop2-mr1 (compile scope) on Hadoop 2.6. (provided scope) and if somebody wants to move to a new platform, they can do the excludes manually. Does that make sense? |
I think so... here's what I'm thinking - let me know if it's crazy. In the pigpen build, I'll mark both the hadoop and cascading dependencies as provided. We'll then suggest in the tutorial that users choose the cascading version/platform they want (as a compile dep) and the hadoop version they want (as a provided dep). Do you think that'll work? That way they can pick whatever versions of cascading and hadoop, and it's up to them to make sure it ends up in the uberjar they ship to hadoop. |
It makes sense to do it that way only if you only use cascading APIs. While most platforms have Hadoop ties, it is perfectly possible to have a platform without Hadoop at all. Our local platform is one example. In that case the idea does not work. Maybe the safest thing to do is shipping a direct dep. to cascading-hadoop2-mr1 by default and let experts use it on other platforms, if they desire. That makes it easy for people to get started and we can then with he community take it from there. |
Ahhh - I think I get it. Making hadoop a provided dependency isn't great because some platforms don't use hadoop at all, so if those jars aren't present, they can't run because our code uses hadoop. It's not so much a question of which hadoop to use; more of a question of to hadoop or not to hadoop. I can think of two ways around this...
My preference is for option 2. I think it's simpler and easier to maintain long term. I feel like the fewer stakes we put in the ground regarding versions & libraries the better. @pkozikow I'd love to hear your thoughts on this. Do you have any concerns with that change? I think there might be one small hiccup to this plan though - it looks like the default taps for reading/writing strings are using some hadoop related classes: https://github.com/Netflix/PigPen/blob/master/pigpen-cascading/src/main/clojure/pigpen/cascading/core.clj#L94 Does this imply that taps are tied to the platform being used? Would different platforms use different taps? And of course there is option 3: stick with the current approach and hard-wire it to use hadoop2. It is an option, but I'd rather fix this the right way now, before people start using it. Once people start using it, it's a lot harder to change stuff like this. |
According to Cascading's documentation, different platforms do indeed use different taps: "... using platform-specific Tap and Scheme classes that hide the platform-related I/O details from the developer". This being the case, I don't see a nice way of implementing all the built-in PigPen load/store functions without using platform-specific code under the hood when building the Cascading flow. It might be possible to do some sort of dynamic class loading at runtime to detect the platform and use the appropriate taps, but that sounds rather ugly and complicated. Besides, I would say that at least in the short term by and large everyone will want to use the hadoop platform. The loss of the local platform is not significant since PigPen has its own local mode anyway. That would mean shipping a direct dep on hadoop and let experts customize with excludes manually, like Andre suggested. |
Fair enough - option 3 is the easier change :) I'm still having problems with hadoop 2, but I'll follow up on the other thread. |
Cascading supports multiple backends and we recommend the Hadoop 2 (YARN) platform for everyone these days. It would be great if pigpen-cascading could move to that. Let me know, if you want a PR for that or if something forces you to stay on Hadoop 1.x.
The text was updated successfully, but these errors were encountered: