-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visible parameter for luigi.Parameters #2278
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to challenge me and explain your train of thought. Thanks!
luigi/worker.py
Outdated
@@ -775,7 +775,7 @@ def _add(self, task, is_complete): | |||
runnable=runnable, | |||
priority=task.priority, | |||
resources=task.process_resources(), | |||
params=task.to_str_params(), | |||
params=task.to_str_params(only_visible=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand here why you set the default to false
, but force the worker to pass true
. I feel that whatever the result may be, it shouldn't affect people who don't have this value set on their Parameters.
Thus if visible is set to true, then only_visible
should be true
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dlstadther Thank you for your response!
I feel that whatever the result may be, it shouldn't affect people who don't have this value set on their Parameters.
It will not affect such people because in luigi.Parameter's constructor visible value by default is True
I'm not sure I understand here why you set the default to false, but force the worker to pass true
I set the default value to false, because method to_str_params
is called from other places and in such cases it expects to recieve all parameters from task regardless of values like significant or visible.
For example,
def _announce_scheduling_failure(self, task, expl):
self._scheduler.announce_scheduling_failure(
worker=self._id,
task_name=str(task),
family=task.task_family,
params=task.to_str_params(only_significant=True), # regadless of visible, will return only significant params
expl=expl,
owners=task._owner_list(),
)
self._scheduler.add_task(
worker=self._id,
task_id=task.task_id,
module=get_work_response.get('task_module'),
family=get_work_response['task_family'],
params=task.to_str_params(), # as before return all params
status=RUNNING,
batch_id=get_work_response['batch_id'],
)
Thus if visible is set to true, then only_visible should be true by default.
In general, i tried to copy significant
logic but for another purpose and visible
is similar to it
def __init__(self, default=_no_value, is_global=False, significant=True, description=None,
config_path=None, positional=True, always_in_help=False, batch_method=None):
def __init__(self, default=_no_value, is_global=False, significant=True, description=None,
config_path=None, positional=True, always_in_help=False, batch_method=None, visible=True):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then, if i understand this correctly... this _add()
method is what the worker uses to push the definition of each task to the web view? So by removing non-visible params from the task's tostr, you remove them from visibility?
Are there any negative side-effects if i have two nearly identical tasks which only differ in username/password? They would have the same result from to_str_params()
when they are not the same task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then, if i understand this correctly... this _add() method is what the worker uses to push the definition of each task to the web view? So by removing non-visible params from the task's tostr, you remove them from visibility?
Exactly.
In my case, the main purpose was to hide password, username, host and port only from web view in tasks like that:
class SomeAction(luigi.Task):
password = luigi.Parameter(visible=False)
username= luigi.Parameter(visible=False)
host= luigi.Parameter(visible=False)
port= luigi.Parameter(visible=False)
id = luigi.IntParameter()
date = luigi.DateParameter()
def output(self):
return luigi.PostgresTarget(...)
def requires(self):
return [OtherTasks(self.password, self.username, self.port, self.host)]
...
In the result, in web page visible only ID and Date. I tested it on our dev place with different parameters and it works fine for me.
Also, for additional testing i pass to luigi.interface.run my own worker_scheduler_factory to produce custom scheduler\workers and i place some logs into it to control task params
Are there any negative side-effects if i have two nearly identical tasks which only differ in username/password? They would have the same result from to_str_params() when they are not the same task.
Probably, i don't correctly understand what do you mean by They would have the same result from to_str_params() when they are not the same task
. For such tasks, to_str_params would have some affect only for params where visible=False and only for web page, not for the task instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I think i understand.
luigi/task.py
Outdated
@@ -469,14 +469,14 @@ def from_str_params(cls, params_str): | |||
|
|||
return cls(**kwargs) | |||
|
|||
def to_str_params(self, only_significant=False): | |||
def to_str_params(self, only_significant=False, only_visible=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[see worker comment]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this lgtm.
Could you @ mention one or two others for review?
Thank you for your approval @dlstadther ! Sorry, If I mentioned someone incorrectly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good start on solving this everlasting issue. I like the idea of introducing a new option instead of changing significant
However this PR don't have any tests nor docs yet. :)
@Tarrasch , If this is the only problem at this moment i will add needed tests and docs as soon as possible =) |
@Tarrasch , i've added docs and tests for visible param. |
Looks good so far. Hmm... I didn't imagine it would be "this simple" solving this so long overdue issue. Maybe let this PR be open for a week or so and let people chime in? Great job so far. :) |
Thank you!
Ok =) |
I'm thinking out aloud here: I think we should consider adding multiple levels of privacy.
It's easy to generalise one category into another. Ex. At first glance one would think invisible params are always insignificant. But I have one use case in our codebase which isn't so. Similarly any attempt to generalise the categories might result in long arguments. But one generalisation that does make sense(To me; Correct me if I'm wrong) is private params being invisible since they are not even available to the scheduler to display or hide. Consequently, it might be a good Idea to add a third state to visible. |
But do we need to store invisible params into history and\or allow to access them from remote scheduler?
At this moment, params with
I can imagine these states or similar to them: |
Yes. We do have some such params in our tasks where machine cares about the params but humans don't.
Exactly what I had in mind. However I would rename |
Well, i've added some fixes for I didn't resolve the problem with tuple like (param_value, visibility_type).
For now, it works like described above:
And also, hidden params are recorded into task_history_params. Later, i'll add unit tests for it Probably, Maybe add additional param inside Task class (in scheduler.py) like public_params to avoid extra overhead on filtering hidden params in this function? |
@TMaYaD , @Gr1f0n6x, Thank you two so much for thinking this through! I really like that you have this discussion and I like the direction it's going! |
Finally, i came up with (i believe not bad ^_^) solution. I tried to not modify old code to not break the compatibility with existing code. I've tested this in two ways:
The one thing i still can't resolve it is the constants for visible modifiers. Many tests failed =/ |
Yes. I think using an enum is a sound abstraction for this. :) |
Hi, @Gr1f0n6x I would be interested in having something like such work due to how we're using Luigi it would be beneficial for us to hide some parameters from the admin interface for convenience. Did you happen to make any progress on this outside of this repo? Is this still something that is on your mind? What are the things that would left in order to consider this feature complete? Thanks! |
Hi @cabouffard ,
The main problem now is to add 3rd state (hidden, in this case) for parameters visibility, because some of my solutions change core logic and it break some tests. If you want just fully hide some parameters and show others - you can use first commit for it (we still use this solution in prod). Also, i have a vacation now, so, i can continue to work on it and i believe, that i'll finish soon this feature =) |
@Gr1f0n6x Thanks for the feedback! We're eager to use that feature on our pipeline! :) |
Still want to say I'm still looking forward to this being implemented. :) |
@Tarrasch i've fixed one of the failed tests. The problem was in |
@Gr1f0n6x Docs are failing for you due to your Basically I mention it here when it happened for another PR. |
@dlstadther , @Tarrasch would it be ok if i replace |
Just looked through this PR history and found @Tarrasch mentioned we're using a backport for enum. I wonder then why this error is occurring... |
Looks like we reference it in tox.ini |
Maybe we need to add it to |
This isn't the beautiful way to do it, but maybe it's still the best so users still can just "download and run" luigi still on python 2.7. You can add a TODO in case we ever stop supporting 2.7 without various backported packages. |
I agree with that, especially about python 2.7. supporting (we use luigi in prod using python 2.7 ^_^), but, at this moment, i don't see any options. Maybe, if we add |
My recommendation fixes tests. But Python 2.7 users would still have to install enum34 from Pypi to not have |
I think that there is still a lot luigi users who use py 2.7 and it will surprise them that after update their should install new package. |
As a Python 2.7 user, I realize that 2020 is coming quicker than I'd like. So i'm personally ok with adding enum34. Do we know if 3.4+ can install |
@Tarrasch , @dlstadther i've added enum34 as deps in tox.ini [testenv:docs] and also add this package as dependency in setup.py for python < 3.4.
And found that it can cause problems in some cases. So, to avoid this, i think, that we could decide to install or not enum34 dynamically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gr1f0n6x Thanks for adding in the dynamic enum34
install and fixing the tests! On that front, i'm satisfied.
After another quick review of the entire PR, all i have left is an optional comment about the unnecessary use of a named positional arg. You can chose to ignore it if you'd like.
Assuming @Tarrasch approves, I say merge it!
Thanks for your hard work!
luigi/scheduler.py
Outdated
self.params = {} | ||
self.public_params = {} | ||
self.hidden_params = {} | ||
self.set_params(params=params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nbd, but it's slight overkill to include the named arg here
set_params(params=params)
== set_params(params)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔︎
luigi/scheduler.py
Outdated
if not task.params: | ||
task.params = _get_default(params, {}) | ||
task.set_params(params=params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔︎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ready for final review?
@@ -48,6 +49,9 @@ def get_static_files(path): | |||
install_requires.remove('python-daemon<3.0') | |||
install_requires.append('sphinx>=1.4.4') # Value mirrored in doc/conf.py | |||
|
|||
if sys.version_info < (3, 4): | |||
install_requires.append('enum34>1.1.0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not strongly against this. But it will break for some users who don't install the package the normal way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dlstadther, what do you think? Maybe it's time to request users to always install through a package manager, hopefully not many will oppose this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does one use luigi without either PyPi or running setup.py
?
So long as people can appropriately use a released version from PyPi or bleeding edge through manual install, I'm 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you could just download the source filles and set your PYTHONPATH=luigi
and do python luigi/.xxx.py
and get going. For the server you still needed tornado.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think people will complain having to install with a package manager or manually setup.py install/develop
@Tarrasch , yes, i think that it is ready for review |
task = Task() | ||
luigi.build(tasks=[task], workers=2, scheduler_port=self.get_http_port()) | ||
|
||
time.sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sleeping in tests is not ideal but okay. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, it's mistake :(
I wanted to add delay only for remote scheduler and only to guarantee that all messages were delivered, because i thought that there is a chance that message will not be delivered
|
||
task = TestTask3() | ||
|
||
self.assertEqual(task.to_str_params(), {'param_one': '1', 'param_two': '2', 'param_three': '3'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What practice do you follow to put out spaces in tests by the way? Some like the Arrange Act Assert style. We have no selected style in luigi so this is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, using additional space i want to emphasize the assertion block.
Just saw this style somewhere ^_^
Aaaaaaawesome!!! We've wanted this for many years already! So nice to see it finally completed! |
See the docs for usage.
* upstream-master: (82 commits) S3 client refactor (spotify#2482) Rename to rpc_log_retries, and make it apply to all the logging involved Factor log_exceptions into a configuration parameter Fix attribute forwarding for tasks with dynamic dependencies (spotify#2478) Add a visiblity level for luigi.Parameters (spotify#2278) Add support for multiple requires and inherits arguments (spotify#2475) Add metadata columns to the RDBMS contrib (spotify#2440) Fix race condition in luigi.lock.acquire_for (spotify#2357) (spotify#2477) tests: Use RunOnceTask where possible (spotify#2476) Optional TOML configs support (spotify#2457) Added default port behaviour for Redshift (spotify#2474) Add codeowners file with default and specific example (spotify#2465) Add Data Revenue to the `blogged` list (spotify#2472) Fix Scheduler.add_task to overwrite accepts_messages attribute. (spotify#2469) Use task_id comparison in Task.__eq__. (spotify#2462) Add stale config Move github templates to .github dir Fix transfer config import (spotify#2458) Additions to provide support for the Load Sharing Facility (LSF) job scheduler (spotify#2373) Version 2.7.6 ...
* upstream-master: S3 client refactor (spotify#2482) Rename to rpc_log_retries, and make it apply to all the logging involved Factor log_exceptions into a configuration parameter Fix attribute forwarding for tasks with dynamic dependencies (spotify#2478) Add a visiblity level for luigi.Parameters (spotify#2278) Add support for multiple requires and inherits arguments (spotify#2475) Add metadata columns to the RDBMS contrib (spotify#2440) Fix race condition in luigi.lock.acquire_for (spotify#2357) (spotify#2477) tests: Use RunOnceTask where possible (spotify#2476) Optional TOML configs support (spotify#2457) Added default port behaviour for Redshift (spotify#2474) Add codeowners file with default and specific example (spotify#2465) Add Data Revenue to the `blogged` list (spotify#2472)
* upstream-master: Remove long-deprecated scheduler config variable alternatives (spotify#2491) Bump tornado milestone version (spotify#2490) Update moto to 1.x milestone version (spotify#2471) Use passed password when create a redis connection (spotify#2489) S3 client refactor (spotify#2482) Rename to rpc_log_retries, and make it apply to all the logging involved Factor log_exceptions into a configuration parameter Fix attribute forwarding for tasks with dynamic dependencies (spotify#2478) Add a visiblity level for luigi.Parameters (spotify#2278) Add support for multiple requires and inherits arguments (spotify#2475) Add metadata columns to the RDBMS contrib (spotify#2440) Fix race condition in luigi.lock.acquire_for (spotify#2357) (spotify#2477) tests: Use RunOnceTask where possible (spotify#2476) Optional TOML configs support (spotify#2457) Added default port behaviour for Redshift (spotify#2474) Add codeowners file with default and specific example (spotify#2465) Add Data Revenue to the `blogged` list (spotify#2472) Fix Scheduler.add_task to overwrite accepts_messages attribute. (spotify#2469) Use task_id comparison in Task.__eq__. (spotify#2462)
I can still see parameter with visibility set to PRIVATE in luigi execution summary Sample Flow import luigi class PTest(luigi.Task):
class PtestExternal(luigi.ExternalTask):
|
@cabhi ,the goal was to hide parameters only in WEB part and in database. |
Description
Visible parameter for luigi.Parameter class which is needed to show/hide some parameters from WEB page such as passwords to database
Motivation and Context
Well, in my case this change is crucial because of some of my tasks connect to database and contain passwords for it which i don't want to show it from web page to all.
In code it will look like this:
Have you tested this? If so, how?
I tested it on my tasks and it works fine. It just removes this parameters from node represents current task, so it shouldn't break any compatibility
(Below images are just examples)
Before:
After:
Maybe, it will be helpful for others too in similar case