NF: Result hooks #3903

mih · 2019-12-06T15:48:33Z

This is aims to be a more flexible alternative to --proc-pre/post and its proposed successor of command hooks (#3264).

The key idea is that we have DataLad's results that all pass through the main event loop, and we can define ad-hoc hooks that run custom actions whenever a matching result is observed.

Key differences to what we already have:

can act more than once per command execution
no "pre" action anymore
informed by the actual result itself
runs dataset procedures, but also any proper datalad command

To define a hook, two config variables need to be set:

datalad.result-hook.<name>.match
datalad.result-hook.<name>.proc

where <name> is any Git config compatible identifier.

match contains a JSON-encoded dict that is used to match a result against in order to test whether the respective hook should run. It can contain any number of keys. For each key it is tested, if the value matches the one in the result, if all match the hook is executed. In addition to == tests, in, not in, and != tests are supported. The operation can be given by wrapping the test value into a list, the first item is the operation label 'eq', 'neq', 'in', 'nin' -- the second value is the test value (set). Example:

{"type": ["in", ["file", "directory"]], "action": "get", "status": "notneeded"}

proc is the specification of what the hook execution comprises. Any datalad command is suitable (which includes run_procedure). The value is a string, where the first word is the name of the datalad command to run (in Python notation). The remainder of the string is a JSON-encoded dict with keyword arguments for the command execution. Unlike match string substitution is supported. Any key from a matching result can be used to trigger a substitution with the respective value in the result dict. In addition a dsarg key is supported that is expanded with the dataset argument that was giving to the command that the eval_func decorator belongs to and is processing the results.
Because of the string substitution using Python's format(), curly braces have to be protected. Hence an example setting could look like:

unlock {{"dataset": "{dsarg}", "path": "{path}"}}

or

run {{"cmd": "touch {path}_annoyed", "dataset": "{dsarg}", "explicit": true}}

Hook evaluation obviously slows processing, especially given the location in the code path (eval_func). The code is trying to minimize this impact. However, the lookup of potential hooks in the config represents an unconditional additional cost.

However, I consider this an extremely powerful mechanism that can be used to achieve custom setups without having to add features to the implementation of particular commands. So in summary I think this is worth the cost.

For more info, please see the test inside.

Benchmarks:

Our standard benchmarks show no impact (not a surprise, not much happening in them). So I ran tests that generate a lot of results (saving a dataset with 10k tiny files):

# setup
for i in $(seq 10000); do s=$(uuid | tr '-' '/'); mkdir -p ${s:5:19} && echo $s > ${s:5}; done;

# no hooks defined, this PR
datalad save  27.97s user 11.27s system 101% cpu 38.800 total

# no hooks defined, master 6031944e8a7770ec59389b6d7ec02d122cfbcbc3
datalad save  28.72s user 11.56s system 84% cpu 47.818 total

Looking forward to your feedback @datalad/developers

TODO:

string substitution with windows paths leads to invalid JSON. No idea how to deal with that yet
investigate and potentially adjust timing of installation success results for datasets. It seems as if further processing is being perform on the dataset (content) after it is yielded (see disabled test inside). Now addressed in BF: Make sure to not issue an install-ok result, before the repo state is final #3906 (merged into this PR too)

…e is final This was discovered in datalad#3903

codecov · 2019-12-08T09:03:12Z

Codecov Report

Merging #3903 into master will increase coverage by 9.69%.
The diff coverage is 78.76%.

@@            Coverage Diff            @@
##           master   #3903      +/-   ##
=========================================
+ Coverage   71.01%   80.7%   +9.69%     
=========================================
  Files         272     274       +2     
  Lines       36107   36220     +113     
=========================================
+ Hits        25640   29233    +3593     
+ Misses      10467    6987    -3480

Impacted Files	Coverage Δ
datalad/core/local/tests/test_resulthooks.py	`100% <100%> (ø)`
datalad/config.py	`97.42% <50%> (+21.87%)`	⬆️
datalad/core/local/resulthooks.py	`63.93% <63.93%> (ø)`
datalad/interface/utils.py	`88.63% <93.33%> (+18.86%)`	⬆️
datalad/downloaders/tests/test_http.py	`58.39% <0%> (-2.19%)`	⬇️
datalad/support/tests/test_annexrepo.py	`96% <0%> (-0.4%)`	⬇️
datalad/tests/test_utils.py	`95.84% <0%> (-0.27%)`	⬇️
datalad/interface/tests/test_unlock.py	`98.07% <0%> (ø)`	⬆️
... and 78 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b45ca69...67a43bf. Read the comment docs.

datalad/core/local/resulthooks.py

bpoldrack · 2019-12-08T18:46:54Z

Generally looks very nice indeed.

I'm struggling a bit ATM to get my head around security implications and what part of it might need addressing where. In principle we have a similar issue with procedures anyway, since you can "catch them" by updating a dataset. And we can't completely avoid the issue without loosing substantial functionality.
However, if we allow code execution via a committed config, we kinda make it worse. So, this is not an objection to that PR and approach (I do think we need that), but I think this raises level of we-need-to-address-this. At least we should give a defense to users, like ignoring .datalad/config wrt to procedures and hooks.

mih · 2019-12-08T18:52:07Z

Thanks for the review!

We already allow this kind of execution based on config with proc-pre/post for individual commands. So in that respect this PR doesnt add a new threat.

bpoldrack · 2019-12-08T18:56:24Z

Yes, agree. Just raises the priority we need to think about this with.

mih · 2019-12-08T19:09:12Z

Maybe we should afford a new ConfigManager instance that is not reading out .datalad/config within eval_func.

However, I would do that in a separate PR, as it affects any and all types of hooks.

i.e. to not read any configuration that is commited in a dataset. This is a needed mode of operation in any situation where it is not safe to have configuration pull from elsewhere affect local operation, e.g. the definition and execution of hooks (see datalad#3903)

mih · 2019-12-09T06:35:53Z

I proposed #3907 to address the security concerns.

mih · 2019-12-09T12:31:43Z

Cool, thanks. When #3907 is merged, I'll fix the resulting conflicts with this one, and proceed to merge, once the tests have passed again. Unless @kyleam comes to a different conclusion ;-)

kyleam · 2019-12-09T13:32:20Z

Unless @kyleam comes to a different conclusion ;-)

If you'd like to wait, I'll find time to review this today, but it's of course fine if you want to proceed with the merge.

Without having looked at any detail into this, I'll just say that I'm happy that @bpoldrack brought up the security concerns, and that that's already led to a PR. I should have thought of those concerns when datalad.<name>.proc-{pre,post} stuff was introduced. Even worse, I was thinking about git hooks/config and security somewhat recently when reading this git thread and it did not occur to me that---given that untracked .git/{config,hooks} raise security concerns---tracked values of .proc-{pre,post} (and probably other stuff related to procedures) should be re-evaluated carefully.

mih · 2019-12-09T13:34:08Z

I'll be more then happy to wait. Thanks!

mih · 2019-12-09T16:27:03Z

Working on resolving the conflict...

This is aims to be a more flexible alternative to --proc-pre/post and its proposed successor of command hooks (datalad#3264). The key idea is that we have DataLad's results that all pass through the main event loop, and we can define ad-hoc hooks that run custom actions whenever a matching result is observed. Key differences to what we already have: - can act more than once per command execution - no "pre" action anymore - informed by the actual result itself - runs dataset procedures, but also any proper datalad commands

Otherwise windows paths and generally Path object instances will crash the machine.

[ci skip]

kyleam

I think this is a clever approach. In addition to being flexible, the implementation is pleasing in terms of being a mostly independent layer on top of the current result handling.

On the other hand, it doesn't seem particularly pleasant to work with from a user-perspective. Beyond the somewhat tedious task of formatting the json values for match and proc, this would require a decent investment from the user to think about how they should match results. It also makes me wonder how consistent or useful of a surface our result records provide for latching onto. And makes me worried that hooks will start matching very specific things about the results, and we're going to fear causing breakage even when changing minor details about the results.

At any rate, given that this is non-intrusive and seems like it could be quite powerful, I'm for going forward with it and seeing how it plays.

datalad/core/local/resulthooks.py

kyleam · 2019-12-09T20:48:42Z

datalad/core/local/resulthooks.py

+                'Incomplete result hook configuration %s in %s' % (
+                    h[:-6], cfg))
+            continue
+        sep = proc.index(' ')


Would there be a downside here of being more lenient and splitting on the first whitespace (proc.split(maxsplit=1))? Solely from a readability point, I'd find it nice to get rid of some of the indexing below.

Given that the proc and match values are coming from the user, it be good to show a bit more helpful message if they aren't valid. For example, the .index() call above would fail with

[ERROR ] substring not found [resulthooks.py:get_hooks_from_config:45] (ValueError)

if the value didn't have a space. The json.loads() is another case (and perhaps more likely when users are trying to put valid json into a git config value). And misformatted arguments would make it through to run_hook before causing issues, so that's yet another spot.

Given that these hooks are secondary to the main command, perhaps these failures should give a warning and continue on with the rest of result processing.

Would there be a downside here of being more lenient and splitting on the first whitespace (proc.split(maxsplit=1))? Solely from a readability point, I'd find it nice to get rid of some of the indexing below.

Oh yes! Thanks for reminding me that we are in PY3 land now!! Done.

I also added the ability to have argument-less call specifications, and a smoke test for them.

Given that the proc and match values are coming from the user, it be good to show a bit more helpful message if they aren't valid. For example, the .index() call above would fail with

[ERROR ] substring not found [resulthooks.py:get_hooks_from_config:45] (ValueError)

if the value didn't have a space. The json.loads() is another case (and perhaps more likely when users are trying to put valid json into a git config value). And misformatted arguments would make it through to run_hook before causing issues, so that's yet another spot.

Given that these hooks are secondary to the main command, perhaps these failures should give a warning and continue on with the rest of result processing.

A broken match spec would now yield such warning:

WARNING: Invalid match specification in datalad.result-hook.annoy.match-json: {"type":["in", ["file"]],"action":"get",status":"notneeded"} [Expecting property name enclosed in double quotes: line 1 column 41 (char 40) [decoder.py:raw_decode:353]], hook will be skipped

Likewise, a broken call spec given something like this:

WARNING: Invalid argument specification for hook annoy (after parameter substitutions): {"cmd":"touch" /tmp/datalad_temp_test_basicsnscytin3/file1_annoyed","dataset":"/tmp/datalad_temp_test_basicsnscytin3","explicit":true} [Expecting ',' delimiter: line 1 column 16 (char 15) [decoder.py:raw_decode:353]], hook will be skipped

mih · 2019-12-10T06:31:35Z

I think this is a clever approach. In addition to being flexible, the implementation is pleasing in terms of being a mostly independent layer on top of the current result handling.

Thx!

On the other hand, it doesn't seem particularly pleasant to work with from a user-perspective. Beyond the somewhat tedious task of formatting the json values for match and proc, this would require a decent investment from the user to think about how they should match results. It also makes me wonder how consistent or useful of a surface our result records provide for latching onto. And makes me worried that hooks will start matching very specific things about the results, and we're going to fear causing breakage even when changing minor details about the results.

I acknowledge all these concerns. But I also see the increased focus on result composition and placement as a chance to treat them more rigorously and thoughtfully, e.g. #3906

Re usability: I simply could not yet come up with something more convenient. But what about tweaking things right away to enable better forward compatibility:

.match -> .match-json
.proc -> call-json (taking above comment into account)

This gives us the freedom to implement support for different/simpler approaches in the future. I guess it is reasonable to assume that any approach would need to specify a criterion and a call specification.

At any rate, given that this is non-intrusive and seems like it could be quite powerful, I'm for going forward with it and seeing how it plays.

Cool, thx!

Perhaps a word on the concrete underlying motivation to implement this now, and like this. The use case is the installation of subdatasets and file content getting in a YODA dataset scenario. Say we install a YODA dataset as a throwaway clone in an HPC environment. We do not care about data safety much, but we do care about performance. Hence we want to turn on as much performance tuning as possible (reckless-install,...). Additionally, we cannot have symlinks for files, because the particular analysis code cannot handle them. At the same time, there are real-world constraints that we need to deal with (unlock blows storage demands up, annex.thin needs v7, v7 adjusted branches have conceptual issues for YODA #3818).

With this feature implemented, I can configure the root dataset clone to arbitrarily tune any newly installed subdataset however I see fit, independent of the particular analysis. I can also make get unlock any file that is actually being obtained, as opposed to all files upfront, or putting the entire (sub)dataset into unlocked mode. And I can do all of that without actually altering any committed content, hence I am not overfitting the datasets themselves to a particular purpose or environment.

…patibility

mih · 2019-12-10T11:04:13Z

OK, I'll be bold and merged this in a few min.

adswa · 2019-12-10T20:25:15Z

Thanks for the comprehensive comments in this PR! I am adding a handbook section on this new feature here.

mih added the feedback-desired label Dec 6, 2019

mih force-pushed the nf-resulthooks branch from cc6aec7 to 36d4ce4 Compare December 7, 2019 14:25

mih added a commit to mih/datalad that referenced this pull request Dec 7, 2019

BF: Make sure to not issue an install-ok result, before the repo stat…

44d0cba

…e is final This was discovered in datalad#3903

mih mentioned this pull request Dec 7, 2019

BF: Make sure to not issue an install-ok result, before the repo state is final #3906

Merged

mih added a commit to mih/datalad that referenced this pull request Dec 7, 2019

BF: Make sure to not issue an install-ok result, before the repo stat…

1881b7c

…e is final This was discovered in datalad#3903

mih force-pushed the nf-resulthooks branch from 36d4ce4 to 1881b7c Compare December 7, 2019 19:38

mih added a commit to mih/datalad that referenced this pull request Dec 8, 2019

BF: Make sure to not issue an install-ok result, before the repo stat…

74ee17a

…e is final This was discovered in datalad#3903

mih added a commit to mih/datalad that referenced this pull request Dec 8, 2019

BF: Make sure to not issue an install-ok result, before the repo stat…

410ec8b

…e is final This was discovered in datalad#3903

mih force-pushed the nf-resulthooks branch from 1881b7c to 00336b5 Compare December 8, 2019 09:03

mih added enhancement It's huge! and removed feedback-desired labels Dec 8, 2019

mih force-pushed the nf-resulthooks branch from 8d9e7eb to ddfc762 Compare December 8, 2019 11:21

mih requested a review from kyleam December 8, 2019 16:57

bpoldrack reviewed Dec 8, 2019

View reviewed changes

datalad/core/local/resulthooks.py Show resolved Hide resolved

This was referenced Dec 9, 2019

ConfigManager can be limited to local config; prevent remote changes of hooks #3907

Merged

Remove --proc-pre/post (replacement: result hooks) #3908

Closed

bpoldrack self-requested a review December 9, 2019 12:27

bpoldrack approved these changes Dec 9, 2019

View reviewed changes

mih added 10 commits December 9, 2019 17:33

BF+TST: Only check symlinks on systems that have that capability

7e64687

BF: Properly JSON encode any substituted result value

6544da1

Otherwise windows paths and generally Path object instances will crash the machine.

TST: Make exclusion more broad, not worth fiddling with the details

c0292bc

RF: Trim imports

a1bf525

TST: Check that hook results also get emitted

a083c4e

RF: Move resulthook code into core.local

77cc1ef

DOC: Function docstrings

19548e9

TST: Can't expect results, if the command doesn't run

db5c417

DOC: State non-obvious fact

d08626e

mih force-pushed the nf-resulthooks branch from 73df715 to d08626e Compare December 9, 2019 16:34

DOC: Correct typos from dataladgh-3903

2d0e399

[ci skip]

kyleam reviewed Dec 9, 2019

View reviewed changes

mih added 4 commits December 10, 2019 08:37

RF: Various renames to avoid semantic conflict and better forward com…

ad50b6b

…patibility

RF: Simplify loop -- case of premature optimization

5fa82dc

ENH: Improve error reporting, give precise hints on causes

870ca16

TST: Smoke test for argument-less hook call specifications

67a43bf

mih merged commit fa39d2f into datalad:master Dec 10, 2019

mih deleted the nf-resulthooks branch December 10, 2019 11:35

kyleam mentioned this pull request Mar 12, 2021

Reduce number of commits when cloning into a dataset #5480

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NF: Result hooks #3903

NF: Result hooks #3903

mih commented Dec 6, 2019 •

edited

Loading

codecov bot commented Dec 8, 2019 •

edited

Loading

bpoldrack commented Dec 8, 2019

mih commented Dec 8, 2019 •

edited

Loading

bpoldrack commented Dec 8, 2019

mih commented Dec 8, 2019

mih commented Dec 9, 2019

mih commented Dec 9, 2019 •

edited

Loading

kyleam commented Dec 9, 2019

mih commented Dec 9, 2019

mih commented Dec 9, 2019

kyleam left a comment

kyleam Dec 9, 2019

mih Dec 10, 2019

mih commented Dec 10, 2019

mih commented Dec 10, 2019

adswa commented Dec 10, 2019

NF: Result hooks #3903

NF: Result hooks #3903

Conversation

mih commented Dec 6, 2019 • edited Loading

Benchmarks:

codecov bot commented Dec 8, 2019 • edited Loading

Codecov Report

bpoldrack commented Dec 8, 2019

mih commented Dec 8, 2019 • edited Loading

bpoldrack commented Dec 8, 2019

mih commented Dec 8, 2019

mih commented Dec 9, 2019

mih commented Dec 9, 2019 • edited Loading

kyleam commented Dec 9, 2019

mih commented Dec 9, 2019

mih commented Dec 9, 2019

kyleam left a comment

Choose a reason for hiding this comment

kyleam Dec 9, 2019

Choose a reason for hiding this comment

mih Dec 10, 2019

Choose a reason for hiding this comment

mih commented Dec 10, 2019

mih commented Dec 10, 2019

adswa commented Dec 10, 2019

mih commented Dec 6, 2019 •

edited

Loading

codecov bot commented Dec 8, 2019 •

edited

Loading

mih commented Dec 8, 2019 •

edited

Loading

mih commented Dec 9, 2019 •

edited

Loading