-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the extra_arguments to easily pass additional arguments to streaming jobs #1929
Conversation
@mfcabrera, thanks for your PR! By analyzing the history of the files in this pull request, we identified @Tarrasch, @daveFNbuck and @mvj3 to be potential reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Good if somebody using hadoop streaming could review as well.
Now I'm having a look at this part of hadoop.py again (after closing #1895) it seems that the objective of both PRs is actually already covered by "arglist += self.streaming_args". Do you agree? |
@heuvel yes and no :P. I see the parameter streaming args in the HadoopJobRunner, but in order to use it you need to specify a custom JobRunner to pass those parameters. Which it seems an overkill to me. with this change you only need to create a method to make it work. I am pretty sure most of the users don't use |
@mfcabrera I would definitely prefer the method approach instead of the custom JobRunner. However, for me to make it work I still have to create a JobRunner to add the -archives argument, which is a generic command option of hadoop streaming (instead of the streaming command options of which -cmdenv is the one I need). It could be an idea to add both an extra_streaming_arguments and an extra_generic_arguments method (generic options should be placed before the streaming options), but that doesn't actually make much sense since all available generic options are already available in the HadoopJobRunner. So I would vote for adding the extra_streaming_arguments method, but to avoid the need of creating a custom jobrunner I would like to have a method for adding archives as well. |
(updated) @heuvel I understand. I actually prefer |
@heuvel I have updated the task and the unittest. I have added the |
@Tarrasch I have tested this change also in my projects. Is there anything else I can do to get this merged?. |
Thanks @mfcabrera! |
Description
Added the
extra_straming_arguments
methods toBaseHadoopJobTask
. This method should return a list of tuples containing additional arguments to the hadoop streaming job. I also added the methodextra_archives
(archives is generic option) so it any subclass overriding this method can make the JobRunner to add extra archives.Motivation and Context
Adding additional/non-supported hadoop streaming parameters requires both modifying
JobTask
and implementing a customHadoopJobRunner
that pass pass that parameter to its parent. See http://mfcabrera.com/python/2016/04/25/python-streaming-blog-org.html for an example.I believe this way is a more flxible way of adding parameters and something like this was proposed by @erikbern in #1895 . Actually this PR supersedes #1895 .
Have you tested this? If so, how?
I have added a unit test and I have tested on my own.