Skip to content

Commit

Permalink
Truncates long streaming job confs (#1773)
Browse files Browse the repository at this point in the history
When running a streaming job with lots of inputs, the job conf gets passed to
each mapper. This can cause an exception from passing too many arguments to the
mapper. This job conf is not actually, needed, so it's safe to truncate. 20000
is recommended as a safe value for this truncation in
http://aajisaka.github.io/hadoop-project/hadoop-streaming/HadoopStreaming.html
  • Loading branch information
daveFNbuck authored and Tarrasch committed Jul 29, 2016
1 parent b12be00 commit 0532453
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions luigi/contrib/hadoop.py
Original file line number Diff line number Diff line change
Expand Up @@ -764,6 +764,7 @@ def on_failure(self, exception):


class JobTask(BaseHadoopJobTask):
jobconf_truncate = 20000
n_reduce_tasks = 25
reducer = NotImplemented

Expand All @@ -773,6 +774,8 @@ def jobconfs(self):
jcs.append('mapred.reduce.tasks=0')
else:
jcs.append('mapred.reduce.tasks=%s' % self.n_reduce_tasks)
if self.jobconf_truncate >= 0:
jcs.append('stream.jobconf.truncate.limit=%i' % self.jobconf_truncate)
return jcs

def init_mapper(self):
Expand Down

0 comments on commit 0532453

Please sign in to comment.