-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NF: Result hooks #3903
NF: Result hooks #3903
Conversation
…e is final This was discovered in datalad#3903
…e is final This was discovered in datalad#3903
…e is final This was discovered in datalad#3903
…e is final This was discovered in datalad#3903
Codecov Report
@@ Coverage Diff @@
## master #3903 +/- ##
=========================================
+ Coverage 71.01% 80.7% +9.69%
=========================================
Files 272 274 +2
Lines 36107 36220 +113
=========================================
+ Hits 25640 29233 +3593
+ Misses 10467 6987 -3480
Continue to review full report at Codecov.
|
Generally looks very nice indeed. I'm struggling a bit ATM to get my head around security implications and what part of it might need addressing where. In principle we have a similar issue with procedures anyway, since you can "catch them" by updating a dataset. And we can't completely avoid the issue without loosing substantial functionality. |
Thanks for the review! We already allow this kind of execution based on config with proc-pre/post for individual commands. So in that respect this PR doesnt add a new threat. |
Yes, agree. Just raises the priority we need to think about this with. |
Maybe we should afford a new ConfigManager instance that is not reading out However, I would do that in a separate PR, as it affects any and all types of hooks. |
i.e. to not read any configuration that is commited in a dataset. This is a needed mode of operation in any situation where it is not safe to have configuration pull from elsewhere affect local operation, e.g. the definition and execution of hooks (see datalad#3903)
I proposed #3907 to address the security concerns. |
If you'd like to wait, I'll find time to review this today, but it's of course fine if you want to proceed with the merge. Without having looked at any detail into this, I'll just say that I'm happy that @bpoldrack brought up the security concerns, and that that's already led to a PR. I should have thought of those concerns when |
I'll be more then happy to wait. Thanks! |
Working on resolving the conflict... |
This is aims to be a more flexible alternative to --proc-pre/post and its proposed successor of command hooks (datalad#3264). The key idea is that we have DataLad's results that all pass through the main event loop, and we can define ad-hoc hooks that run custom actions whenever a matching result is observed. Key differences to what we already have: - can act more than once per command execution - no "pre" action anymore - informed by the actual result itself - runs dataset procedures, but also any proper datalad commands
Otherwise windows paths and generally Path object instances will crash the machine.
[ci skip]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a clever approach. In addition to being flexible, the implementation is pleasing in terms of being a mostly independent layer on top of the current result handling.
On the other hand, it doesn't seem particularly pleasant to work with from a user-perspective. Beyond the somewhat tedious task of formatting the json
values for match
and proc
, this would require a decent investment from the user to think about how they should match results. It also makes me wonder how consistent or useful of a surface our result records provide for latching onto. And makes me worried that hooks will start matching very specific things about the results, and we're going to fear causing breakage even when changing minor details about the results.
At any rate, given that this is non-intrusive and seems like it could be quite powerful, I'm for going forward with it and seeing how it plays.
datalad/core/local/resulthooks.py
Outdated
'Incomplete result hook configuration %s in %s' % ( | ||
h[:-6], cfg)) | ||
continue | ||
sep = proc.index(' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a downside here of being more lenient and splitting on the first whitespace (proc.split(maxsplit=1)
)? Solely from a readability point, I'd find it nice to get rid of some of the indexing below.
Given that the proc
and match
values are coming from the user, it be good to show a bit more helpful message if they aren't valid. For example, the .index()
call above would fail with
[ERROR ] substring not found [resulthooks.py:get_hooks_from_config:45] (ValueError)
if the value didn't have a space. The json.loads()
is another case (and perhaps more likely when users are trying to put valid json into a git config value). And misformatted arguments would make it through to run_hook
before causing issues, so that's yet another spot.
Given that these hooks are secondary to the main command, perhaps these failures should give a warning and continue on with the rest of result processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a downside here of being more lenient and splitting on the first whitespace (
proc.split(maxsplit=1)
)? Solely from a readability point, I'd find it nice to get rid of some of the indexing below.
Oh yes! Thanks for reminding me that we are in PY3 land now!! Done.
I also added the ability to have argument-less call specifications, and a smoke test for them.
Given that the
proc
andmatch
values are coming from the user, it be good to show a bit more helpful message if they aren't valid. For example, the.index()
call above would fail with[ERROR ] substring not found [resulthooks.py:get_hooks_from_config:45] (ValueError)
if the value didn't have a space. The
json.loads()
is another case (and perhaps more likely when users are trying to put valid json into a git config value). And misformatted arguments would make it through torun_hook
before causing issues, so that's yet another spot.Given that these hooks are secondary to the main command, perhaps these failures should give a warning and continue on with the rest of result processing.
A broken match spec would now yield such warning:
WARNING: Invalid match specification in datalad.result-hook.annoy.match-json: {"type":["in", ["file"]],"action":"get",status":"notneeded"} [Expecting property name enclosed in double quotes: line 1 column 41 (char 40) [decoder.py:raw_decode:353]], hook will be skipped
Likewise, a broken call spec given something like this:
WARNING: Invalid argument specification for hook annoy (after parameter substitutions): {"cmd":"touch" /tmp/datalad_temp_test_basicsnscytin3/file1_annoyed","dataset":"/tmp/datalad_temp_test_basicsnscytin3","explicit":true} [Expecting ',' delimiter: line 1 column 16 (char 15) [decoder.py:raw_decode:353]], hook will be skipped
Thx!
I acknowledge all these concerns. But I also see the increased focus on result composition and placement as a chance to treat them more rigorously and thoughtfully, e.g. #3906 Re usability: I simply could not yet come up with something more convenient. But what about tweaking things right away to enable better forward compatibility:
This gives us the freedom to implement support for different/simpler approaches in the future. I guess it is reasonable to assume that any approach would need to specify a criterion and a call specification.
Cool, thx! Perhaps a word on the concrete underlying motivation to implement this now, and like this. The use case is the installation of subdatasets and file content getting in a YODA dataset scenario. Say we install a YODA dataset as a throwaway clone in an HPC environment. We do not care about data safety much, but we do care about performance. Hence we want to turn on as much performance tuning as possible (reckless-install,...). Additionally, we cannot have symlinks for files, because the particular analysis code cannot handle them. At the same time, there are real-world constraints that we need to deal with (unlock blows storage demands up, annex.thin needs v7, v7 adjusted branches have conceptual issues for YODA #3818). With this feature implemented, I can configure the root dataset clone to arbitrarily tune any newly installed subdataset however I see fit, independent of the particular analysis. I can also make |
OK, I'll be bold and merged this in a few min. |
Thanks for the comprehensive comments in this PR! I am adding a handbook section on this new feature here. |
This is aims to be a more flexible alternative to --proc-pre/post and its proposed successor of command hooks (#3264).
The key idea is that we have DataLad's results that all pass through the main event loop, and we can define ad-hoc hooks that run custom actions whenever a matching result is observed.
Key differences to what we already have:
To define a hook, two config variables need to be set:
datalad.result-hook.<name>.match
datalad.result-hook.<name>.proc
where
<name>
is any Git config compatible identifier.match
contains a JSON-encoded dict that is used to match a result against in order to test whether the respective hook should run. It can contain any number of keys. For each key it is tested, if the value matches the one in the result, if all match the hook is executed. In addition to==
tests,in
,not in
, and!=
tests are supported. The operation can be given by wrapping the test value into a list, the first item is the operation label 'eq', 'neq', 'in', 'nin' -- the second value is the test value (set). Example:proc
is the specification of what the hook execution comprises. Any datalad command is suitable (which includesrun_procedure
). The value is a string, where the first word is the name of the datalad command to run (in Python notation). The remainder of the string is a JSON-encoded dict with keyword arguments for the command execution. Unlikematch
string substitution is supported. Any key from a matching result can be used to trigger a substitution with the respective value in the result dict. In addition adsarg
key is supported that is expanded with thedataset
argument that was giving to the command that theeval_func
decorator belongs to and is processing the results.Because of the string substitution using Python's
format()
, curly braces have to be protected. Hence an example setting could look like:or
Hook evaluation obviously slows processing, especially given the location in the code path (
eval_func
). The code is trying to minimize this impact. However, the lookup of potential hooks in the config represents an unconditional additional cost.However, I consider this an extremely powerful mechanism that can be used to achieve custom setups without having to add features to the implementation of particular commands. So in summary I think this is worth the cost.
For more info, please see the test inside.
Benchmarks:
Our standard benchmarks show no impact (not a surprise, not much happening in them). So I ran tests that generate a lot of results (saving a dataset with 10k tiny files):
Looking forward to your feedback @datalad/developers
TODO: