Support callback dependencies #2378

mdekstrand · 2019-08-07T23:31:27Z

Supporting 'callback' dependencies and outputs (for lack of a better term) would enable a number of interesting possibilities for using DVC to control processes that store data outside of DVC's file access, without requiring an explosion of remote access possibilities.

This would generalize what is possible with HTTP outputs and dependencies.

An example of what this could look like in a DVC file:

deps:
- cmd: python step-status.py --my-step
  md5: <checksum>

Instead of consulting a file, as with path:, DVC would run the specified command (which should be relatively quick), and compute the MD5 hash of its output. That command could do whatever it needs to in order to get the data status.

My specific use case is using DVC to control a large data import process that processes raw data files, loads them into PostgreSQL, and performs some additional computations involving intermediate files. I would implement a script that extracts data status from the PostgreSQL database so that DVC can check whether a step is up-to-date with the actual data currently in the database. I could implement this with HTTP dependencies and something like PostgREST, but that would introduce additional infrastructure requirements for people using the code (namely, a running PostgREST server).

The text was updated successfully, but these errors were encountered:

mdekstrand · 2019-08-07T23:35:19Z

This is probably one particular instantiation of #1577.

efiop · 2019-08-08T02:42:51Z

Hi @mdekstrand ! We currently have a so called "callback stages" that are ran every time. With those, you could do something like

dvc run -o marker python step-status.py --my-step
dvc run -d marker -d ... -o ... mycommand

this way, dvc would always run marker.dvc and if it changes the marker file, it will trigger reproductions down the pipeline. Would something like this work for you?

mdekstrand · 2019-08-08T14:28:01Z

Almost, but not quite. The problem is forcing ordering between generating the marker file, and whatever step came before to populate the database with the content that step-status checks. As soon as I make the callback stage depend on another, to force ordering, it ceases to be a callback stage. For example:

dvc run -o step1-out.txt -d ... python step1.py
dvc run -o step1-check.txt -d step1-out.txt python step-status.py --my-step
dvc run -d step1-check.txt -d step1

Without this, running a repro might run the step-status command before the data it requires is ready.

efiop · 2019-08-08T19:32:56Z

@mdekstrand Thank you for the explanation! Indeed, our current callback stage is not very useful in that scenario. And how about an explicit flag smth like always_reproduce: true, that would make arbitrary stage always run on repro. Would that be suitable in your scenario? I'm asking about that because we actually have plans to introduce it #1407 instead of the current fragile no-deps assumption.

As to your idea with

deps:
- cmd: python step-status.py --my-step
  md5: <checksum>

it seems like we don't really need the md5 of the output, we could simply use the return code of that command as an indicator. E.g. <0 - error, 0 - didn't change, >0 - changed.

mdekstrand · 2019-08-09T19:32:29Z

The always_reproduce: true solution would make the desired functionality possible, I believe, but the dependency graph would be cluttered with status steps.

I don't think I like the return code option, though, because then the status script must know how to compare current state against previous state. If DVC checksums the scripts output, then all the script needs to be able to do is emit a stable summary or description of current state, and DVC's existing logic can take care of determining if that represents a change or not.

efiop · 2019-08-10T01:01:19Z

@mdekstrand great point! In that case, what if we make the command return the checksum itself through stdout instead of us computing md5 of its output? That has the potential of being used not only for dependencies but also for outputs, as a cheap alternative-checksum plugin. There are a lot of things to consider with it though.

shcheklein · 2019-08-10T22:18:11Z

Just a temporary workaround that comes to my mind. To make an intermediate stage effectively a "callback" stage, we can make it depend (along with other things, like DB upload pipeline) on an artificial callback stage that for example just dumps the current timestamp.

We can even reuse this dummy callback stage everywhere to make any number of stages always reproducible.

I love the idea to have cmd dependencies. It's simple to implement and solves a very good use case in a neat way.

@dmpetrov @Suor @pared @MrOutis your thoughts?

pared · 2019-08-12T11:56:41Z

Seems like good feature request. I don't see too much problems with implementation at first sight. Probably some graceful exception handling will be required (when status check will be performed on non existing data). Also, I think that some kind of communiciation with user might be a good idea. Like Your executable dependency returned : X, do you want to proceed? In order to avoid situation where check returns some error and we assume it was desired output.

Suor · 2019-08-12T13:30:48Z

I am in favor of cmd dependencies, looks generic and could be used as simple as

cmd: psql -U user -c 'select id from some_table order by id desc limit 1'

and many other alike.

We need to come up with command line interface though. How should this look in dvc run?

Suor · 2019-08-12T13:36:38Z

How about dvc run --dep-cmd="psql -U ..." ...?

ghost · 2019-08-13T16:22:54Z

How about dvc run --dep-cmd="psql -U ..." ...?

@Suor , the only problem I'm seeing with this one is when using multiple dependencies;

dvc run -d script.py -d database_dump.csv --dep-cmd="psql -U ..."

How would you know which cmd correspond to which dependency?

Suor · 2019-08-13T18:50:36Z

Cmd is a separate dependency, it doesn't have path, and doesn't correspond to anything.

efiop · 2019-08-13T21:46:51Z

@Suor I'm actually not sure about that. We need the path to build the DAG properly. So what we need is something like

cmd: mycmd
md5: mymd5
path: path/to/dep

Suor · 2019-08-14T01:57:17Z

But there is no path, DAG should not include cmd dep or this should be special cased somehow.

pared · 2019-08-14T08:33:26Z

Shouldn't comand dependency scripts be handled by scm?

efiop · 2019-08-14T15:47:43Z

@Suor My thinking was that mycmd would analyze path/to/dep inside, just as an alternative to our md5-based checksums. Not sure mycmd without dep path is any good. @mdekstrand What are you thoughts?

@pared They should. I didn't mean that path/to/dep should be a path to the script that we are running, but rather to the dep that it is analyzing.

Suor · 2019-08-14T17:19:10Z

@efiop path has no meaning here so it should not be neither in stage file nor in command line UI. I don't understand what "analyze path/to/dep inside" even means, there is no path, it could be a call to database, some API request, whatever.

efiop · 2019-08-14T22:59:43Z

@Suor The command mentioned by the OP, is analyzing db, so this stage should depend on it, the command is only helping us to judge if that dependency has changed or not.

shcheklein · 2019-08-14T23:36:23Z

@efiop tbh, I also don't understand where does path come from? can you give an example when we would need it and what file that "command dependency" would depend on?

pared · 2019-08-15T08:43:52Z

@shcheklein In this use case, you could, for example, have has_changed script that in command dependency will be called like --cmd-dep="python has_changed.py some_file". So some_file will be dependency of dependency.

I guess we can say that we are already doing that, our current -d dependency is something like --cmd-dep="md5sum dependency".

shcheklein · 2019-08-15T16:05:12Z

@pared thanks for the example! (Btw, I would say it depends on has_changed.py the same way as on some_file in this specific case.)

I still don't quite understand what exactly are you suggesting though. Could you please describe the logic, CLI, high level implementation you have in mind?

Like - we run dvc run so-and-so and it is doing this-and-that, and generates a DVC-file with these fields.

Suor · 2019-08-15T16:15:34Z

@pared if you need dependency of dependency when you should create a separate stage for that command and set that script as dependency.

In general path have no meaning here, this could easily be one-liner either with psql/mysql/... cli or even date +%F to execute stage no more often than once a day.

shcheklein · 2019-08-15T16:17:20Z

@Suor you don't even need a separate stage. You can make a second regular file dependency in the that same stage as far as I understand. But may be I'm still missing something.

This is an extension of our callback stages (no deps, so always considered as changed) for stages with dependencies, so that these stages could be used properly in the middle of the DAG. See attached issue for more info. Related to iterative#2378

efiop · 2019-09-09T23:03:08Z

@mdekstrand --always-changed was released in 0.59.2, please give it a try and be sure to let us know how that works for you 🙂 Thanks for all the feedback!

mdekstrand · 2019-09-15T12:25:46Z

@efiop Will do as soon as I can! It's been a busy 2-3 weeks.

mdekstrand · 2019-10-11T01:09:25Z

@efiop I have started making our book data tools use DVC with --always-changed, and it it is mostly working. See https://github.com/BoiseState/bookdata-tools/tree/dvc; I describe the DVC dance setup in the bottom of the README.

However, there is a semantic hole in the --always-changed strategy that pluggable deps/outs would fill. While the --always-changed dependency enables me to flag downstream steps for re-run, I do not have a way to flag that the Big Database Import Job's output does not match what the DVC file remembers its output being, and therefore it should be re-run. In the current setup, I have 2 stages for an import:

import.dvc runs the importer, depending on its inputs and producing import.transcript, a file describing the import run.
import.status.dvc depends on import.transcript and is --always-changed; it writes out import.status containing a stable summary of the current contents of the database.
Anything needing the results of this import, such as process.dvc, depends on import.status.

If the database is out of date or changed somehow, import.status will change, so process.dvc will rerun. However, an import.status change means import.dvc's (conceptual) outputs are out of date, and it should also be rerun. Right now there is not a way to do this with --always-changed without introducing circular dependencies. If I make a copy of import.status a dependency of import.dvc, then it will run once for empty, then run again; not at all what I want either.

The only way I see to fix this is to introduce a mechanism where an output is computed by a user-defined process, that gets invoked and re-run every time DVC needs to check the output status. This computation should, of course, be very cheap. This is exactly the scenario allowed with e.g. S3 remote outputs/dependencies: if my code were storing data on S3 instead of PostgreSQL, I could make the import job have an S3 output, and the process job have an S3 input, and everything would be fine.

As discussed upthread, there are serious problems with trying to wire together callback dependencies and outputs. An alternate approach would be to make external dependencies pluggable: if I can register a handler for bdstatus:// URLs, and implement them by querying the database for status information, then import.dvc could have an output of bdstatus://import, and process.dvc could depend on bdstatus://import, and URLs do their normal job of creating the out/dep link needed to order the dependency graph. There would of course be some security concerns; dvc shouldn't just run URL handlers in the local repository when you call dvc status. But a custom URL solution seems like it could be very promising for plugging this hole.

efiop · 2019-10-14T21:16:26Z

@mdekstrand Sorry for the delay.

Why can't you just combine your import.dvc and import.status.dvc and use --always-changed for it? You can also check for remote db status there, without a need for bdstatus plugins.

mdekstrand · 2019-10-14T22:27:33Z

There are a couple of reasons why I'm not fond of that option:

I need to re-implement 'am I up to date?' logic myself so the import can auto-abort, instead of letting dvc take care of that so long as I can produce stable descriptions of state.
My pipeline looks like this.

The import itself is actually in the middle of 3-4 steps. The general flow looks like:

Set up schema(s)
Import data
Index imported data
Integrate

There is some back-and-forth; the indexing of some data sets depends on the integration of some others. This repository is importing 6 different data sets (in our current active set - we have plans to add at least 2 more) in different formats (CSV, JSON, MARC XML, soon RDF). DVC is giving me 95% of what I need to wire it all together, avoid unnecessary reruns, and debug the dependency graph. I'm currently making it work by ensuring that I have values that will propagate through .status files to trigger reruns when the code is run against a new database. It seems to mostly work; it's a bit ugly.

mdekstrand · 2019-10-15T14:18:35Z

To reply to @Suor in#2531, for a piece of the discussion that feels more relevant here:

A general takeout from all of this - we should define where dvc responsibility ends, like say non-file non-dvc managed deps. Trying to stretch our responsibility over those limits means bringing together mutually contradictory objectives and won't lead to anything good.

This is reasonable.

However, DVC has already extended some of this responsibility with external dependencies / outputs; it just limits them to S3, Google Cloud, Azure, and HTTPS. That was actually my inspiration for even thinking about this: I saw that feature as doing exactly what I needed, except limited to a closed set of sources.

So in my mind, that is what this issue is about: allowing users to extend the set of external dependency & output sources. I thought callbacks might be an easier way to do that than plugins, but am no longer so convinced.

I could implement a small web server that exposes status information over HTTP, and use URLs pointing to it as external dependencies and outputs. I thought about that. It does, however, increase the infrastructure requirements for users of the code in which I would use this feature (and that code is intended for external consumption).

efiop · 2019-10-20T13:02:57Z

However, there is a semantic hole in the --always-changed strategy that pluggable deps/outs would fill. While the --always-changed dependency enables me to flag downstream steps for re-run, I do not have a way to flag that the Big Database Import Job's output does not match what the DVC file remembers its output being, and therefore it should be re-run. In the current setup, I have 2 stages for an import:

@mdekstrand Would it be possible to check db state without actually importing it? I actually had that scenario in mind when we were talking about this issue and assumed that you have a way of knowing db version without a long-running import. You've mentioned http server, so I assume it will be able to tell db version without importing it, right? Why can't you have a local script that would do that?

Suor · 2019-10-20T13:17:08Z

However, DVC has already extended some of this responsibility with external dependencies / outputs; it just limits them to S3, Google Cloud, Azure, and HTTPS. That was actually my inspiration for even thinking about this: I saw that feature as doing exactly what I needed, except limited to a closed set of sources.

S3 and friends are files in all the senses we need. We can checksum them, commit to cache and checkout, push and pull, which is the cycle of reproducibility dvc provides. This is not so with some external db.

mdekstrand · 2019-10-21T22:28:40Z

@efiop wrote:

Would it be possible to check db state without actually importing it?

Yes. If I implemented the HTTP server, it would either return a 404 when asked for the status of an import operation that has not run, or it would return an empty result that would not match the result from any successful import. That would be enough to let DVC know that the output does not exist and should be recreated. My ideal end state actually depends on this: if external outputs are pluggable, DVC needs to be able to attempt to retrieve a stage's output/status blob when the stage has not yet run, to detect that it is missing or contains unexpected content.

I am currently checking initial state, and then arranging for that to propagate through the state files in the rest of my operations.

@Suor:

S3 and friends are files in all the senses we need. We can checksum them, commit to cache and checkout, push and pull, which is the cycle of reproducibility dvc provides. This is not so with some external db.

Yes. However, as I understand it, the individual scripts are responsible for pushing blobs to S3, and DVC simply checks those blobs. I don't see a substantial, meaningful difference between "python script pushes $BLOB to S3" and "python script pushes a bunch of data to PostgreSQL that results in stable $BLOB".

A few wrap-up comments:

I no longer believe that callback dependencies, as originally specified, are the correct solution, because of the issues connecting inputs to outputs.
I believe making external dependencies pluggable in some way has a lot of potential value. I may implement the web server approach to prototype how I would use pluggable external dependencies.
Our code is currently working, and I think it is in a state such that re-running against a fresh or different database will correctly trigger re-runs. Therefore, --always-changed is helping, and is allowing us to accomplish our goals, albeit in a way that has a messier dependency graph than I would like.
We are perhaps somewhat abusing DVC outside of its intended use case for the project where this is coming up. I've come to love DVC, and it solves many problems my research group faces; I am using it here in part because there's nothing better, and in part because even if there were another tool that were better for this specific use case (mabe doit?), software stack consistency helps with training and onboarding research assistants.

mdekstrand · 2020-03-10T18:52:13Z

I have hacked up an ugly version of this feature as a monkey-patch to DVC in our book data tools: https://github.com/BoiseState/bookdata-tools

What I do is define a custom remote, with corresponding dependency and output classes, that keys off the pgstat:// URL scheme to get status for import stages from the database. It's working quite beautifully, so long as I remember to make sure every pgstat output is uncached. Crucial code is in bookdata/dvcpatch.py, wrapper script scripts/dvcw.py installs the patch and calls DVC's main().

I would rather not have to monkey-patch, although the patch is not difficult. Pluggable remotes would help a lot. There is the problem that dvc status probably shouldn't run code in the repository you've checked out, for security reasons. This could be addressed by having DVC bail if it detects that a remote plugin is needed to compute status, and having a command like dvc trust that puts in a marker flag to tell DVC that the local repository is OK. .dvc/config.local seems like the right place to put that marker, if there is a way to restrict config settings to specific sources (.dvc/config should not be allowed to set the trusted flag, obviously).

efiop · 2020-03-12T14:31:13Z

@mdekstrand Whoa, that is neat! Do you think pgstat could be useful outside of your project? Would it make sense to include it into the core dvc?

mdekstrand · 2020-03-12T17:57:24Z

The details of pgstat are pretty specific to these kinds of data ingest and integration pipelines, although they aren't necessarily specific to this one. There is a lot of logic sprinkled throughout the import code that makes it work. It would be nice to make it generally reusable, but I probably don't have time in the near future. Pulling it in to DVC would probably make DVC more DBT-like (referencing #2945); although I have never used it, arguably dbt might be a better fit for this project, but it would come at the substantial expense of increasing the number of tools my team must be familiar with (we use DVC for other projects where it is clearly a fit, including the experiments that actually use this data).

What I think would be trivially generalizable to include in DVC is the idea of custom remotes & external deps/outs for this kind of thing. If I could put in .dvc/config:

[custom-external-remotes]
pgstat = bookdata.pgstat

And then write a bookdata.pgstat module that contains exists, get_file_checksum, and maybe some of the other methods (optionally, of course; exists, get_file_checksum, and remove are the only ones I actually needed to implement), and have DVC automatically wire that up (pending marking the repository as trusted, as per my previous comment), it would make this kind of thing trivial to support with whatever logic a specific project requires.

shcheklein · 2020-03-13T22:40:00Z

@mdekstrand @efiop I wonder if this discussion here relevant? (please, read to the very end - it started from Python function dependencies and now is going around custom functional filters).

efiop · 2020-03-15T20:08:22Z

@shcheklein It is relevant. It is also relevant to #1577.

@mdekstrand As to .dvc/config, it seems like there should be a better way than a custom remote. More along the lines of a custom general plugin that would recognise and deal with things accordingly. And that is where #1577 and #3439 come to mind. Please give it a look, maybe something like that would be a suitable solution. Maybe instead of the .dvc/config, we would introduce something like .dvc/plugins/ with your python plugins... But that is just from the top of my head, need to reconcider all of these together, because it looks like it all has the same general solution.

mdekstrand · 2020-03-16T18:45:30Z

@efiop A general plugin interface would be quite helpful, and I'm fine with .dvc/plugins instead of .dvc/config.

Doing this will require well-defined extension points with documented, stable, and hopefully minimal interfaces. It will probably also require the DVC config schema to be extensible, so plugins can be configured.

A first version that loads and runs plugins that then monkey-patch DVC would allow things to get started (and I could remove my DVC wrapper). There is still the security issue, though (DVC shouldn't run plugins until a repository is trusted).

mdekstrand · 2021-03-25T15:07:07Z

Just FYI DVC2 broke my monkey-patch, and it wasn't clear how to add it back, so I have reverted back to the two-step mechanism with .status files, which is ugly and somewhat slow (invoking the external process for each status check is slower than an in-process query).

I would still very much like to see support for custom URL schemas for external outputs and dependencies. It would solve this problem rather cleanly.

michaelmior · 2022-12-21T15:51:56Z

Just to add in another use case. I started with several stages depending on the entire source directory. This is cumbersome because when unrelated files change, I still need to rerun output (or explicitly commit what I know hasn't changed to avoid this). I hacked together something based on pydeps that can analyze dependencies of the scripts I use for my stages and update my dvc.yaml to have only the necessary dependencies.

However, this now requires that I run this script occasionally to make sure my dependencies are up-to-date. With a callback dependency, I could just run this script to check only the necessary dependencies. Since the dependency graph is somewhat complicated to compute, I could even have another stage which depends on the whole source directory and calculates and caches the dependency graph. Then just check the cached dependency graph in other stages with a callback dependency.

efiop added the feature request Requesting a new feature label Aug 8, 2019

shcheklein added question I have a question? research labels Aug 10, 2019

mdekstrand mentioned this issue Aug 12, 2019

Script with DVC PIReTship/bookdata-tools#1

Closed

efiop added the p1-important Important, aka current backlog of things to do label Aug 13, 2019

efiop self-assigned this Aug 13, 2019

efiop added the c8-full-day label Aug 13, 2019

efiop mentioned this issue Sep 9, 2019

run: introduce --always-changed #2479

Merged

2 tasks

efiop added p3-nice-to-have It should be done this or next sprint and removed p1-important Important, aka current backlog of things to do labels Sep 10, 2019

efiop mentioned this issue Sep 24, 2019

option --always-changed of dvc run #2531

Closed

dmpetrov mentioned this issue Dec 23, 2019

Support for Databae queries/Views as data sources #2945

Closed

mdekstrand mentioned this issue Mar 16, 2020

Plugins for semantic changes tracking in dependencies #1577

Closed

jorgeorpinel mentioned this issue Mar 12, 2021

repro: fallback order of execution for independent stages #5181

Closed

pmrowla mentioned this issue Oct 1, 2021

exp run behaves differently than repro for stages without dependencies #6718

Closed

efiop mentioned this issue Apr 1, 2023

repro: stage cache hit although imported module has changed #9195

Open

skshetry mentioned this issue Sep 21, 2023

Epic: Database table/non-file dependencies #9945

Closed

7 tasks

dberenbaum mentioned this issue Dec 11, 2023

DVCX datasets #10114

Closed

skshetry unassigned efiop Mar 6, 2024

Support callback dependencies #2378

Support callback dependencies #2378

Comments

mdekstrand commented Aug 7, 2019

mdekstrand commented Aug 7, 2019

efiop commented Aug 8, 2019

mdekstrand commented Aug 8, 2019

efiop commented Aug 8, 2019

mdekstrand commented Aug 9, 2019

efiop commented Aug 10, 2019

shcheklein commented Aug 10, 2019

pared commented Aug 12, 2019

Suor commented Aug 12, 2019

Suor commented Aug 12, 2019

ghost commented Aug 13, 2019

Suor commented Aug 13, 2019

efiop commented Aug 13, 2019

Suor commented Aug 14, 2019

pared commented Aug 14, 2019 • edited Loading

efiop commented Aug 14, 2019

Suor commented Aug 14, 2019

efiop commented Aug 14, 2019

shcheklein commented Aug 14, 2019

pared commented Aug 15, 2019 • edited Loading

shcheklein commented Aug 15, 2019

Suor commented Aug 15, 2019 • edited Loading

shcheklein commented Aug 15, 2019

efiop commented Sep 9, 2019

mdekstrand commented Sep 15, 2019

mdekstrand commented Oct 11, 2019 • edited Loading

efiop commented Oct 14, 2019 • edited Loading

mdekstrand commented Oct 14, 2019 • edited Loading

mdekstrand commented Oct 15, 2019

efiop commented Oct 20, 2019

Suor commented Oct 20, 2019

mdekstrand commented Oct 21, 2019

mdekstrand commented Mar 10, 2020

efiop commented Mar 12, 2020

mdekstrand commented Mar 12, 2020

shcheklein commented Mar 13, 2020

efiop commented Mar 15, 2020

mdekstrand commented Mar 16, 2020

mdekstrand commented Mar 25, 2021

michaelmior commented Dec 21, 2022

pared commented Aug 14, 2019 •

edited

Loading

pared commented Aug 15, 2019 •

edited

Loading

Suor commented Aug 15, 2019 •

edited

Loading

mdekstrand commented Oct 11, 2019 •

edited

Loading

efiop commented Oct 14, 2019 •

edited

Loading

mdekstrand commented Oct 14, 2019 •

edited

Loading