Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: Result hooks #3903

Merged
merged 15 commits into from
Dec 10, 2019
Merged

NF: Result hooks #3903

merged 15 commits into from
Dec 10, 2019

Conversation

mih
Copy link
Member

@mih mih commented Dec 6, 2019

This is aims to be a more flexible alternative to --proc-pre/post and its proposed successor of command hooks (#3264).

The key idea is that we have DataLad's results that all pass through the main event loop, and we can define ad-hoc hooks that run custom actions whenever a matching result is observed.

Key differences to what we already have:

  • can act more than once per command execution
  • no "pre" action anymore
  • informed by the actual result itself
  • runs dataset procedures, but also any proper datalad command

To define a hook, two config variables need to be set:

  • datalad.result-hook.<name>.match
  • datalad.result-hook.<name>.proc

where <name> is any Git config compatible identifier.

match contains a JSON-encoded dict that is used to match a result against in order to test whether the respective hook should run. It can contain any number of keys. For each key it is tested, if the value matches the one in the result, if all match the hook is executed. In addition to == tests, in, not in, and != tests are supported. The operation can be given by wrapping the test value into a list, the first item is the operation label 'eq', 'neq', 'in', 'nin' -- the second value is the test value (set). Example:

{"type": ["in", ["file", "directory"]], "action": "get", "status": "notneeded"}

proc is the specification of what the hook execution comprises. Any datalad command is suitable (which includes run_procedure). The value is a string, where the first word is the name of the datalad command to run (in Python notation). The remainder of the string is a JSON-encoded dict with keyword arguments for the command execution. Unlike match string substitution is supported. Any key from a matching result can be used to trigger a substitution with the respective value in the result dict. In addition a dsarg key is supported that is expanded with the dataset argument that was giving to the command that the eval_func decorator belongs to and is processing the results.
Because of the string substitution using Python's format(), curly braces have to be protected. Hence an example setting could look like:

unlock {{"dataset": "{dsarg}", "path": "{path}"}}

or

run {{"cmd": "touch {path}_annoyed", "dataset": "{dsarg}", "explicit": true}}

Hook evaluation obviously slows processing, especially given the location in the code path (eval_func). The code is trying to minimize this impact. However, the lookup of potential hooks in the config represents an unconditional additional cost.

However, I consider this an extremely powerful mechanism that can be used to achieve custom setups without having to add features to the implementation of particular commands. So in summary I think this is worth the cost.

For more info, please see the test inside.

Benchmarks:

Our standard benchmarks show no impact (not a surprise, not much happening in them). So I ran tests that generate a lot of results (saving a dataset with 10k tiny files):

# setup
for i in $(seq 10000); do s=$(uuid | tr '-' '/'); mkdir -p ${s:5:19} && echo $s > ${s:5}; done;

# no hooks defined, this PR
datalad save  27.97s user 11.27s system 101% cpu 38.800 total

# no hooks defined, master 6031944e8a7770ec59389b6d7ec02d122cfbcbc3
datalad save  28.72s user 11.56s system 84% cpu 47.818 total

Looking forward to your feedback @datalad/developers

TODO:

  • string substitution with windows paths leads to invalid JSON. No idea how to deal with that yet
  • investigate and potentially adjust timing of installation success results for datasets. It seems as if further processing is being perform on the dataset (content) after it is yielded (see disabled test inside). Now addressed in BF: Make sure to not issue an install-ok result, before the repo state is final #3906 (merged into this PR too)

mih added a commit to mih/datalad that referenced this pull request Dec 7, 2019
mih added a commit to mih/datalad that referenced this pull request Dec 7, 2019
mih added a commit to mih/datalad that referenced this pull request Dec 8, 2019
mih added a commit to mih/datalad that referenced this pull request Dec 8, 2019
@codecov
Copy link

codecov bot commented Dec 8, 2019

Codecov Report

Merging #3903 into master will increase coverage by 9.69%.
The diff coverage is 78.76%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #3903      +/-   ##
=========================================
+ Coverage   71.01%   80.7%   +9.69%     
=========================================
  Files         272     274       +2     
  Lines       36107   36220     +113     
=========================================
+ Hits        25640   29233    +3593     
+ Misses      10467    6987    -3480
Impacted Files Coverage Δ
datalad/core/local/tests/test_resulthooks.py 100% <100%> (ø)
datalad/config.py 97.42% <50%> (+21.87%) ⬆️
datalad/core/local/resulthooks.py 63.93% <63.93%> (ø)
datalad/interface/utils.py 88.63% <93.33%> (+18.86%) ⬆️
datalad/downloaders/tests/test_http.py 58.39% <0%> (-2.19%) ⬇️
datalad/support/tests/test_annexrepo.py 96% <0%> (-0.4%) ⬇️
datalad/tests/test_utils.py 95.84% <0%> (-0.27%) ⬇️
datalad/interface/tests/test_unlock.py 98.07% <0%> (ø) ⬆️
... and 78 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b45ca69...67a43bf. Read the comment docs.

@bpoldrack
Copy link
Member

Generally looks very nice indeed.

I'm struggling a bit ATM to get my head around security implications and what part of it might need addressing where. In principle we have a similar issue with procedures anyway, since you can "catch them" by updating a dataset. And we can't completely avoid the issue without loosing substantial functionality.
However, if we allow code execution via a committed config, we kinda make it worse. So, this is not an objection to that PR and approach (I do think we need that), but I think this raises level of we-need-to-address-this. At least we should give a defense to users, like ignoring .datalad/config wrt to procedures and hooks.

@mih
Copy link
Member Author

mih commented Dec 8, 2019

Thanks for the review!

We already allow this kind of execution based on config with proc-pre/post for individual commands. So in that respect this PR doesnt add a new threat.

@bpoldrack
Copy link
Member

Yes, agree. Just raises the priority we need to think about this with.

@mih
Copy link
Member Author

mih commented Dec 8, 2019

Maybe we should afford a new ConfigManager instance that is not reading out .datalad/config within eval_func.

However, I would do that in a separate PR, as it affects any and all types of hooks.

mih added a commit to mih/datalad that referenced this pull request Dec 9, 2019
i.e. to not read any configuration that is commited in a dataset.
This is a needed mode of operation in any situation where it is
not safe to have configuration pull from elsewhere affect local
operation, e.g. the definition and execution of hooks (see
datalad#3903)
@mih
Copy link
Member Author

mih commented Dec 9, 2019

I proposed #3907 to address the security concerns.

@bpoldrack bpoldrack self-requested a review December 9, 2019 12:27
@mih
Copy link
Member Author

mih commented Dec 9, 2019

Cool, thanks. When #3907 is merged, I'll fix the resulting conflicts with this one, and proceed to merge, once the tests have passed again. Unless @kyleam comes to a different conclusion ;-)

@kyleam
Copy link
Contributor

kyleam commented Dec 9, 2019

Unless @kyleam comes to a different conclusion ;-)

If you'd like to wait, I'll find time to review this today, but it's of course fine if you want to proceed with the merge.

Without having looked at any detail into this, I'll just say that I'm happy that @bpoldrack brought up the security concerns, and that that's already led to a PR. I should have thought of those concerns when datalad.<name>.proc-{pre,post} stuff was introduced. Even worse, I was thinking about git hooks/config and security somewhat recently when reading this git thread and it did not occur to me that---given that untracked .git/{config,hooks} raise security concerns---tracked values of .proc-{pre,post} (and probably other stuff related to procedures) should be re-evaluated carefully.

@mih
Copy link
Member Author

mih commented Dec 9, 2019

I'll be more then happy to wait. Thanks!

@mih
Copy link
Member Author

mih commented Dec 9, 2019

Working on resolving the conflict...

mih added 10 commits December 9, 2019 17:33
This is aims to be a more flexible alternative to --proc-pre/post and
its proposed successor of command hooks (datalad#3264).

The key idea is that we have DataLad's results that all pass through the
main event loop, and we can define ad-hoc hooks that run custom actions
whenever a matching result is observed.

Key differences to what we already have:

- can act more than once per command execution
- no "pre" action anymore
- informed by the actual result itself
- runs dataset procedures, but also any proper datalad commands
Otherwise windows paths and generally Path object instances will crash
the machine.
Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a clever approach. In addition to being flexible, the implementation is pleasing in terms of being a mostly independent layer on top of the current result handling.

On the other hand, it doesn't seem particularly pleasant to work with from a user-perspective. Beyond the somewhat tedious task of formatting the json values for match and proc, this would require a decent investment from the user to think about how they should match results. It also makes me wonder how consistent or useful of a surface our result records provide for latching onto. And makes me worried that hooks will start matching very specific things about the results, and we're going to fear causing breakage even when changing minor details about the results.

At any rate, given that this is non-intrusive and seems like it could be quite powerful, I'm for going forward with it and seeing how it plays.

datalad/core/local/resulthooks.py Outdated Show resolved Hide resolved
datalad/core/local/resulthooks.py Outdated Show resolved Hide resolved
datalad/core/local/resulthooks.py Outdated Show resolved Hide resolved
'Incomplete result hook configuration %s in %s' % (
h[:-6], cfg))
continue
sep = proc.index(' ')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a downside here of being more lenient and splitting on the first whitespace (proc.split(maxsplit=1))? Solely from a readability point, I'd find it nice to get rid of some of the indexing below.


Given that the proc and match values are coming from the user, it be good to show a bit more helpful message if they aren't valid. For example, the .index() call above would fail with

[ERROR  ] substring not found [resulthooks.py:get_hooks_from_config:45] (ValueError) 

if the value didn't have a space. The json.loads() is another case (and perhaps more likely when users are trying to put valid json into a git config value). And misformatted arguments would make it through to run_hook before causing issues, so that's yet another spot.

Given that these hooks are secondary to the main command, perhaps these failures should give a warning and continue on with the rest of result processing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a downside here of being more lenient and splitting on the first whitespace (proc.split(maxsplit=1))? Solely from a readability point, I'd find it nice to get rid of some of the indexing below.

Oh yes! Thanks for reminding me that we are in PY3 land now!! Done.

I also added the ability to have argument-less call specifications, and a smoke test for them.

Given that the proc and match values are coming from the user, it be good to show a bit more helpful message if they aren't valid. For example, the .index() call above would fail with

[ERROR  ] substring not found [resulthooks.py:get_hooks_from_config:45] (ValueError) 

if the value didn't have a space. The json.loads() is another case (and perhaps more likely when users are trying to put valid json into a git config value). And misformatted arguments would make it through to run_hook before causing issues, so that's yet another spot.

Given that these hooks are secondary to the main command, perhaps these failures should give a warning and continue on with the rest of result processing.

A broken match spec would now yield such warning:

WARNING: Invalid match specification in datalad.result-hook.annoy.match-json: {"type":["in", ["file"]],"action":"get",status":"notneeded"} [Expecting property name enclosed in double quotes: line 1 column 41 (char 40) [decoder.py:raw_decode:353]], hook will be skipped

Likewise, a broken call spec given something like this:

WARNING: Invalid argument specification for hook annoy (after parameter substitutions): {"cmd":"touch" /tmp/datalad_temp_test_basicsnscytin3/file1_annoyed","dataset":"/tmp/datalad_temp_test_basicsnscytin3","explicit":true} [Expecting ',' delimiter: line 1 column 16 (char 15) [decoder.py:raw_decode:353]], hook will be skipped

@mih
Copy link
Member Author

mih commented Dec 10, 2019

I think this is a clever approach. In addition to being flexible, the implementation is pleasing in terms of being a mostly independent layer on top of the current result handling.

Thx!

On the other hand, it doesn't seem particularly pleasant to work with from a user-perspective. Beyond the somewhat tedious task of formatting the json values for match and proc, this would require a decent investment from the user to think about how they should match results. It also makes me wonder how consistent or useful of a surface our result records provide for latching onto. And makes me worried that hooks will start matching very specific things about the results, and we're going to fear causing breakage even when changing minor details about the results.

I acknowledge all these concerns. But I also see the increased focus on result composition and placement as a chance to treat them more rigorously and thoughtfully, e.g. #3906

Re usability: I simply could not yet come up with something more convenient. But what about tweaking things right away to enable better forward compatibility:

.match -> .match-json
.proc -> call-json (taking above comment into account)

This gives us the freedom to implement support for different/simpler approaches in the future. I guess it is reasonable to assume that any approach would need to specify a criterion and a call specification.

At any rate, given that this is non-intrusive and seems like it could be quite powerful, I'm for going forward with it and seeing how it plays.

Cool, thx!

Perhaps a word on the concrete underlying motivation to implement this now, and like this. The use case is the installation of subdatasets and file content getting in a YODA dataset scenario. Say we install a YODA dataset as a throwaway clone in an HPC environment. We do not care about data safety much, but we do care about performance. Hence we want to turn on as much performance tuning as possible (reckless-install,...). Additionally, we cannot have symlinks for files, because the particular analysis code cannot handle them. At the same time, there are real-world constraints that we need to deal with (unlock blows storage demands up, annex.thin needs v7, v7 adjusted branches have conceptual issues for YODA #3818).

With this feature implemented, I can configure the root dataset clone to arbitrarily tune any newly installed subdataset however I see fit, independent of the particular analysis. I can also make get unlock any file that is actually being obtained, as opposed to all files upfront, or putting the entire (sub)dataset into unlocked mode. And I can do all of that without actually altering any committed content, hence I am not overfitting the datasets themselves to a particular purpose or environment.

@mih
Copy link
Member Author

mih commented Dec 10, 2019

OK, I'll be bold and merged this in a few min.

@mih mih merged commit fa39d2f into datalad:master Dec 10, 2019
@mih mih deleted the nf-resulthooks branch December 10, 2019 11:35
@adswa
Copy link
Member

adswa commented Dec 10, 2019

Thanks for the comprehensive comments in this PR! I am adding a handbook section on this new feature here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants