-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new hadoop_jvm module for java/scala hadoop jobs. #9
Conversation
* new JvmJobTask and extracted some common functionality from HadoopJobTask into BaseHadoopJobTask * moved jobconf logic into the BaseHadoopJobTask * bare-bones JvmHadoopJobRunner and extracted common functionality from HadoopJobRunner into static methods (submitting/tracking a job).
Looks good, I'll take a closer look tomorrow. I definitely think it makes a lot of sense to have alternatives to Hadoop Streaming in Luigi proper, thanks a lot for the pull request. I have some code for running jars, but it's pretty ugly so I didn't want to add it to Luigi yet. I attached it below just to have something to compare to. Btw: what is the purpose of the NamedHdfsTarget class? Btw 2: one thing that I would suggest is to make the output target atomic. The code below achieves that in an ugly way. Not sure what's best. Btw 3: Is there a specific reason you are using inheritance? As opposed to just putting common code in a function or static method. import subprocess
import sys, os
import luigi, spotify.luigi
class JavaTask(luigi.Task):
def run(self):
def fix_path(x, t):
if isinstance(x, spotify.luigi.HdfsTarget): # input/output
if x.exists():
return x.path
else:
y = spotify.luigi.HdfsTarget(is_tmp=True)
t.append((y, x))
return y.path
else:
return str(x)
tmp_files = []
args = [fix_path(x, tmp_files) for x in self.args()]
jar = self.jar()
for d in ['.', '/usr/share/spotify-rec-sys-java']:
f = os.path.join(d, jar)
if os.path.exists(f):
jar = f
break
else:
raise Exception('no jar found')
java_class = self.java_class()
cmd = ' '.join(['hadoop', 'jar', jar, java_class] + args)
print cmd
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True)
for line in process.stdout:
print >> sys.stderr, '---', line.strip()
if process.wait(): # wait for termination
raise 'java process returned with nonzero status'
for a, b in tmp_files:
a.move(b)
def jar(self):
return 'rec-sys-java-1.0-SNAPSHOT.jar' |
Cool, good to hear. We also use pig and hive and I'd want to add support for those if it sounds good.
Nice, thanks for sharing.
D'oh, I went through a lot of iterations of the code and forgot to clean some things up before sending, sorry! NamedHdfsTarget is from one of those old iterations and can be removed (the idea was to use it as a named argument to the MR job, e.g.
It's a bit tricky to do this in a generic way, since many of our jobs just expect a parameter that's an output directory. Some even write to multiple directories. I'll have to think about it some more. FWIW, we tend to use
I think this is the major design decision. I can see it going three ways:
My current implementation is a start at 2, but it's not complete... it's kind of a 1&2 hybrid. Just to add some context, a I come from a java background where inheritance is heavily used, and I'm not super familiar with the approaches you use throughout the codebase, so I'm really eager to hear your thoughts. |
2013/1/6 Joe Crobak notifications@github.com
We have been running a bunch of stuff on S3 too. One thing you can do is to A design goal for Luigi has been to try to make things atomic unless it's
I don't have strong feelings about any of those options. I think Honestly I think Luigi in the current state could do remove some of the Btw when you mean meta class you mean having a bunch of static methods like I'll take another look at your patch set later today!
|
I talked to Elias Freider (the other maintainer, and usually full of OOP wisdom). He thought your patch looked good, so that's great. From my perspective the remaining question is whether we can make the task atomic. |
Sounds good re. atomic operations. The only caveat is that we need to have the temp output location on the same filesystem as our output (i.e. not /tmp) since you can't Let me put together some cleanups (i.e. remove re. meta, I was thinking of a proper class so that you could have inheritance to keep things DRY. |
2013/1/7 Joe Crobak notifications@github.com
It might be hard to figure out a clean way to do atomic file operations.
|
Oops, what I meant was this is in order to do keep it in the same file system :) |
* renames hadoop_jvm -> hadoop_jar, and JvmHadoop* to HadoopJar*. * remove NamedHdfsTarget * New example for terasort/teragen.
Here's an updated PR with some examples for discussion (also hoping for feedback on those -- curious how you guys propagate params like that). One way I can envision doing the atomic operations is to use something like from your code snippet in The other option would be to have some other method that parallels |
I thought about it a bit more. In the example I attached above:
You could also replace the output() method as you pointed out but I think the problem still persist because you need to either inspect the type of it, or add some generic method on a Target level that returns a "tempified" version. Another alternative is to assume that the output() method always returns HdfsTarget types, which might be a reasonable approach. I think the same thing applies to having an atomic_output() method as well. With the caveat that the code isn't atomic if the user forgets to use atomic_output() instead of output(). Otoh we wouldn't have to replace output() in runtime which is a great benefit. I'll check tomorrow with some coworkers to see what they think about these options. I probably seem crazy for being so concerned about the atomicity, but I've seen non atomic code leading to really bad problems in the past. Like realizing a month later that we are missing lots of data :) Btw your example looks great! It's a little nonstandard since it takes raw paths as parameters. I actually never realized you can do that, probably because all our tasks map to specific paths in Hdfs and I even do that for ad hoc jobs. But I like the fact that you do it that way – it shows that there's some flexibility. One thing that may not be obvious is that there is a convenience method called input() on the Task class. I.e. instead of return [self.requires().output().path, self.output().path] you can do return [self.input().path, self.output().path] And just to be clear in the jar code example I attached above you can also drop the .path if you want to return [self.input(), self.output()] |
Sorry for being late to the discussion. Awesome to have you here jcrobak! I might have missed something, but in order to achieve some sort of atomicity on this task (as well as "improved" atomicity on streaming tasks), I suggest either adding another flag to the HdfsTarget constructor: HdfsTarget(path, with_success_file=False) or subclassing HdfsTarget to something like MapRedTarget, both with the change that they check the existance of the _SUCCESS flag file in the exists()-method. Then we would also have to add something in the run()-method to remove any existing outputs if they don't contain the _SUCCESS-file (indicating a previously failed attempt that hasn't been cleaned up). What do you think? I might have missed something here. As for the inheritance, I think a baseclass for MapReduce jobs that contains overridable common methods, variables and access to hadoop configuration would be nice and your patch is a good step in that direction. On top of that we could have both HadoopJar, HadoopStreaming, HiveJob and PigJob classes that make use of the same common attributes. |
One problem with having a with_success_file flag is that the default value would be to have non-atomic file operations. Another issue with _SUCCESS files is that it doesn't integrate well with external tasks that are built somewhere outside Luigi. Both of these issues can be resolved by flipping it around so that instead we have a _NOT_DONE file that signals that the output is not finished yet. But another problem remains. Where do you commit the transaction? I just started thinking that maybe commit() should be a method on any Target. So that just doing a close() doesn't commit it. However all targets that are returned in the output() methods would be autocommitted if the task finishes successfully. Doing it this way would also mean that we don't have to deal with any special flag files for LocalTarget and HdfsTarget because we can still support atomic writes by file renames. The only downside of this would be that if you are doing file operations outside of a task, then you have to remember to explicitly commit() the target once you are done. On the other hand we could remove the need to perform close() on the file handle so for a Target object. So for code run within a Task that manually opens/writes file this might even lead to shorter code. Another advantage is that tasks that write multiple output would be almost transactional (except in the rare case where the code fails while looping over the output and committing them, but if you want to take it to the extreme I guess you could implement multi-phase locking). What do you think about this idea? It would be a pretty fundamental change in how the Target object works, but otoh I think the patch wouldn't be very big. |
What I meant was that the with_success_file flag (or subclass) would only be used to specify outputs of hadoop jobs where we are sure that the jobs themselves create _SUCCESS files (like standard MapReduce jobs). A problem with using a _NOT_DONE-like structure is that it could be hard to satisfy atomicity in the writing of this flag file for MapReduce jobs since a job will create its output directory itself and not from python code (possible race condition). There is also the risk that something goes wrong when writing this file and then there is a risk that incomplete data gets processed. Here's a suggestion that mixes both solutions:
Is this too complicated? |
@freider responded as I was typing this up, but it's probably worth my input, too. re. success flags, I think luigi needs to support both successflags and not -- at least we need it to :). The way I was envisioning this was that HdfsTarget would take a flag param (default to None), and exists would check if the flag inside the directory exists if it's not None, and otherwise if the directory exists (this is how Oozie does it). Here are some use cases I have in mind:
Re. the re. BaseHadoopJob and inheritance -- I agree with @frieder's suggestions. I can take some time to reorganize things as part of this PR or do it as a follow up (I'd slightly prefer the latter since the scope of the changeset seems to be growing larger). How do you guys feel about limiting the scope of this changeset a bit? I have some local changes to introduce the atomicity stuff that @erikbern and I discussed, I can add that in (i'm assuming if we later added flag support, we'd only apply the rename stuff when the Target didn't have a flag). Then we can do the flag handling and the hierarchy as follow ups? But iff you'd rather that I roll them all in together, though, just let me know. |
Great discussion. Seems like we're discussing three just slightly related things:
|
8 jan 2013 kl. 16:36 skrev Joe Crobak notifications@github.com:
I think Freider wanted to do that at some point anyway, so probably nothing required for now. But let's see what he thinks.
|
* atomic file renames. * more docs. * cleaned up example args.
new hadoop_jvm module for java/scala hadoop jobs.
We use scala mapreduce rather than hadoop streaming, so I'm trying to add support for that into luigi. This is a preliminary draft that I was hoping to get some feedback on.
If you don't feel like this belongs in luigi-proper, then we can keep the code in our own place, but it'd be nice to have some of the refactors I did to make it easier to share code. If you think this makes sense in luigi, maybe it'd make sense to have the streaming-specific code subclass the general purpose (poorly named) JvmHadoopJobTask. Thoughts?
commit message:
HadoopJobTask into BaseHadoopJobTask
from HadoopJobRunner into static methods (submitting/tracking a
job).