Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the extra_arguments to easily pass additional arguments to streaming jobs #1929

Merged
merged 3 commits into from
Feb 12, 2017

Conversation

mfcabrera
Copy link
Contributor

@mfcabrera mfcabrera commented Nov 22, 2016

Description

Added the extra_straming_arguments methods to BaseHadoopJobTask. This method should return a list of tuples containing additional arguments to the hadoop streaming job. I also added the method extra_archives (archives is generic option) so it any subclass overriding this method can make the JobRunner to add extra archives.

Motivation and Context

Adding additional/non-supported hadoop streaming parameters requires both modifying JobTask and implementing a custom HadoopJobRunner that pass pass that parameter to its parent. See http://mfcabrera.com/python/2016/04/25/python-streaming-blog-org.html for an example.

I believe this way is a more flxible way of adding parameters and something like this was proposed by @erikbern in #1895 . Actually this PR supersedes #1895 .

Have you tested this? If so, how?

I have added a unit test and I have tested on my own.

@mention-bot
Copy link

@mfcabrera, thanks for your PR! By analyzing the history of the files in this pull request, we identified @Tarrasch, @daveFNbuck and @mvj3 to be potential reviewers.

Copy link
Contributor

@Tarrasch Tarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Good if somebody using hadoop streaming could review as well.

@heuvel
Copy link

heuvel commented Nov 23, 2016

Now I'm having a look at this part of hadoop.py again (after closing #1895) it seems that the objective of both PRs is actually already covered by "arglist += self.streaming_args". Do you agree?

@mfcabrera
Copy link
Contributor Author

@heuvel yes and no :P. I see the parameter streaming args in the HadoopJobRunner, but in order to use it you need to specify a custom JobRunner to pass those parameters. Which it seems an overkill to me. with this change you only need to create a method to make it work. I am pretty sure most of the users don't use streaming_args because of that. Maybe the method should be called extra_streaming_arguments. In other words, the objective of this PR is to avoid need of creating a custom JobRunner for passing additional parameters. I would go to the extend of deprecating that parameter.

@heuvel
Copy link

heuvel commented Nov 23, 2016

@mfcabrera I would definitely prefer the method approach instead of the custom JobRunner.

However, for me to make it work I still have to create a JobRunner to add the -archives argument, which is a generic command option of hadoop streaming (instead of the streaming command options of which -cmdenv is the one I need).

It could be an idea to add both an extra_streaming_arguments and an extra_generic_arguments method (generic options should be placed before the streaming options), but that doesn't actually make much sense since all available generic options are already available in the HadoopJobRunner. So I would vote for adding the extra_streaming_arguments method, but to avoid the need of creating a custom jobrunner I would like to have a method for adding archives as well.

@mfcabrera
Copy link
Contributor Author

mfcabrera commented Nov 23, 2016

(updated) @heuvel I understand. I actually prefer extra_streaming_arguments idea. I will think what would be a better solution for this. But generic arguments should be easier to add (i.e. without creating a custom runner).

@mfcabrera
Copy link
Contributor Author

@heuvel I have updated the task and the unittest. I have added the extra_archives method to be overridden. That way you can have a per-class archives parameter (you can make it more general using inheritance). I also added extra_streaming_arguments method, that will add any extra parameters at the end. The idea is to have all the generic options covered and the extra streaming options available without the need of creating a new HadoopRunner. The other option was basically change the call to DefaultHadoopRunner to explicitly pass the parameters but that might break things. @Tarrasch this commit check fail, but I saw the log and it does not look related to the change. Is theres a way to re-run travis checks?

@mfcabrera
Copy link
Contributor Author

Hi @heuvel @Tarrasch I have fixed the test and additionally I tested locally (and nothing breeks!). However, it would be good if someone else tests the -archive thing using extra_archives and the cmdenv using extra_streaming_arguments.

@mfcabrera
Copy link
Contributor Author

@Tarrasch I have tested this change also in my projects. Is there anything else I can do to get this merged?.

@Tarrasch Tarrasch merged commit 6ecedb7 into spotify:master Feb 12, 2017
@Tarrasch
Copy link
Contributor

Thanks @mfcabrera!

kreczko pushed a commit to kreczko/luigi that referenced this pull request Mar 28, 2017
@mfcabrera mfcabrera deleted the jobtask-extra-parameters branch December 27, 2019 07:58
This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants