Reconfigurable pipelines #1462

dmpetrov · 2018-12-31T11:29:05Z

Many issues require reconfiguration of stages and even pipelines:

In Make dvc run handle files with same name but different path #973 @Hong-Xiang was asking about reusing (reconfigurable) pipelines.
Discord discussion with vern from 11/27/18: "it's annoying to write them all (stages) out by hand and then do it again for each color (parameter)."
pipelines: parametrize using environment variables / DVC properties #1416 parametrize pipeline \ step - not config file, just parameters.
How to manage repetitive dvc run commands (like unpacking of many zip files)? #1119 repetitive commands. I see a similarity with parametrizable commands where only a single output is in use and without creating a separate directory for each experiment (./output.p instead of gs1/output.p).

A concept of reconfigurable-stage should be introduced in DVC.

Open questions:

how to pass config file (do we need multiple config files)?
how to pass params (should we override config by the params or it is a separate concept)?
how to specify input files?
how to specify the output directory (gs1/)?
how to specify an output file without an output directory ( see ./output.p instead of gs1/output.p from the above)?
how to make it work for pipelines?
Should build cache repro: use build cache for deterministic stages #1234 be part of the solution? It will allow caching intermediate results of reusable stages (if step1 is the same in a few pipeline instances).

UPDATE: #1214 might be also related to this issue.
UPDATE2: Add a quote from vern and open question 7.

The text was updated successfully, but these errors were encountered:

prihoda · 2019-01-02T09:49:12Z

Sounds great, I guess we should think about all the use cases and what the new functionality will bring compared to the existing solutions.

Running an experiment that reproduces the whole pipeline with new parameters (probably in a separate branch).
- Current solution A: Manually edit the parameters in each stage file and reproduce the pipeline.
- Current solution B: Use source config.sh before each parametrized command. Input and output paths cannot change.
- Possible improvement: Global DVC config file loaded automatically before each command pipelines: parametrize using environment variables / DVC properties #1416. Parameters could also be incorporated into input and output paths.
Keeping track of multiple outputs generated by the same command and different parameters (in the same branch).
- Current solution A: Duplicate stages with the same command, just different parameters. The stages can be created manually or using a for cycle (for i in {1..10}; do dvc run cmd --option $i; done). Input and output wildcards can be used, although only when first running the command - they will be hardcoded in the stage file after that.
- Current solution B: Single stage containing a for cycle. Input and output wildcards can be used and if they are contained inside a folder, they won't have to be hardcoded since the input/output dependency will only specify the folder (useful together with run/repro: add option to not remove outputs before reproduction #1214 to solve How to manage repetitive dvc run commands (like unpacking of many zip files)? #1119).
- Possible improvement 1: Support input and output wildcards that would allow solution B to work even with files in the same folder. The stage command could be run once for all inputs (unzip *.zip - possible even when multiple input/output paths contain wildcards) or separately for each file which would be provided as a variable (unzip $INPUT - only possible if only one input/output path contains wildcards). Providing a corresponding output path would probably have to be handled by the user (png2jpg $INPUT ${INPUT/.png/.jpg}).
  - Would have to use quotes when executing: dvc run -d 'input*.zip' -o 'output*.zip' 'unzip*.zip'. Might be solved using prompt Get run command from stdout #1415.
  - Might be tricky to implement, but should be possible by taking all the inputs/outputs matching the wildcard and handling them as an imaginary folder (which can be empty or contain any number of files). Just checking output path duplicates would be harder, since you would have to make sure no path can match both wildcards.
- Possible improvement 2: Support specifying parameters as part of the stage file to avoid for cycles (dvc: consider introducing build matrix #1018 (comment))

dmpetrov · 2019-01-04T06:44:05Z

@prihoda I agree with both your statements but I need to clarify a couple things.

First, I didn't get this part "Parameters could also be incorporated into input and output paths." Does it mean that you would like to change inputs and outputs by parameters? To my mind, inputs, outputs and parameters are different concepts. And DVC should provide and ability to change them separately like

dvc run-reconf my_reconf_stage --input raw/input.csv --output output.p --config myconf1.yaml --param alpha=0.57 --param steps=400

What do you think about this?

Second, "Possible improvement 2" looks better to me since our custome wildcards will have more limitations (compared to users scripts\loops) and it might be tricky to implement as you said. Does the code from above dvc run-reconf my_reconf_stage ... look like your "Possible improvement 2"?

PS: I think, I need to clarify "#1119 repetitive commands" part. It is not about loops (loops are fine and probably even better than custom wildcards), it is about running a reconfigurable stage (let's imaging unzip is a reconf stage) and we would like to have an output of this stage in any place of our project (not only in a specific stage directory).

dmpetrov · 2019-01-04T06:45:27Z

I'd add one more use case to the requirement list that might be extremely useful.

we should provide an ability to "import" stages\pipelines from external places like Git repositories.

In this way, users will be able to create a "library" of reusable stages\pipelines and reuse them from different projects (through copy, Got-submodules or git clone https://my-dvc-repo).

UPDATE: This comment was extracted as a separate feature request #1472

prihoda · 2019-01-04T10:20:28Z

@dmpetrov For the inputs and outputs, I meant that they could also contain variables from the config file. But now I see that my suggestions were trying to solve a different, less challenging problem - reproducing an existing stage with changed parameters, inputs or outputs (where all of those could have multiple values - unzipping multiple files, evaluating multiple parameters).

So the point of reconfigurable stages is different than I thought - basically you want a named library of "stage templates" or "reusable stages" that define a specific command, right? And wrapping those in "reusable pipelines" that would define a whole pipeline. This would definitely be useful, but you would have to make sure you are not reinventing the wheel.

I see that there are two levels to pipeline (workflow) management:

Reproducible level providing a versioned "archive" of a single specific pipeline execution, including data (DVC, as you call experiment management software)
Reusable level providing a reusable pipeline definition, basically serving as a way to easily wrap a lot of depending subtasks into an executable "program" (often called workflow management software)

There are loads of workflow managers that operate on the reusable level, see https://github.com/pditommaso/awesome-pipeline,
e.g. Cromwell using the Workflow Definition Language: https://github.com/openwdl/wdl

You could imagine writing a workflow in these tools which would actually run each command using dvc run, providing a re-usable pipeline that produces the data along with their DVC stages, all of which could be stored in GIT.

I definitely see the benefit in reusable stages, just keep in mind you're entering a whole new world of existing solutions 😄

dmpetrov · 2019-01-04T19:20:21Z

@prihoda It looks like the problems of reconfigurable stages\pipelines and library of stages\pipelines are related and the solutions can complement each other. If we have a way to define a reconfigurable stage why don't we provide an ability to extract this stage from a project and reuse it from a different project?

prihoda · 2019-01-05T12:08:43Z

Sure a library could be useful. My point is only that reconfigurable/reusable pipelines are a world on their own, with many existing solutions.

If I understand it correctly, reconfigurable stages would basically just define a command and its inputs, outputs and other parameters. But isn't that already what any script can do? What would a reconfigurable stage bring as opposed to writing a bash script?

So the main contribution would be the reconfigurable pipelines, but again, it would have to provide some benefits over just writing a bash script that calls each step. The main problem is that you don't know the exact intermediate files that will be created when executing the pipeline, since they are based on parameters, e.g. an unknown number of input files or model hyperparameter values.

prihoda · 2019-01-05T12:09:47Z

For example, let's say you want to create a pipeline that chops a list of files into chunks of 100 lines, sorts the lines in each produced file and then merges the files into one final file. The stages are:

chunk NUM_LINES IN_FILE which turns MYFILE into MYFILE.0 MYFILE.100 MYFILE.200 etc.
sort IN_PATH OUT_PATH which sorts lines in the file alphanumerically
merge IN_PATH1 .. IN_PATHX OUT_PATH which merges multiple files into one file

I see two options to define a reconfigurable pipeline:

The current "DVC way" as a graph, where stage files are edges. The pipeline is represented as a folder structure with pre-created DVC files. Each stage would need to input and output a dynamic list of files, which requires a) for cycles in the command or a b) "DVC native" way to define a cycle inside a stage:
a. Using for cycles in the command can be done with lists or wildcards, e.g. for file in input/*.txt; do ...; done. Supporting wildcards in input/output paths would remove the need to store a folder with the results of each intermediate stage.
b. Using a "DVC native" stage cycle that runs the command for each parameter value. Input/output wildcards could also be used.
Using some kind of a pipeline definition file, with a custom workflow definition language (like WDL https://github.com/openwdl/wdl), where the stages "know about each other" and can access all parameters. This is basically like writing a BASH file just with more structure.

Are you thinking about solution 1 or 2? Or do you have something else in mind?

efiop · 2019-01-06T06:16:31Z

Sorry for interrupting guys, but it seems like my emails are getting lost somewhere. @prihoda i've sent you a few messages and didn't hear back at all, could you please contact me back? Thanks.

prihoda · 2019-01-07T08:27:31Z

@efiop Sorry I only check my email from my laptop, I was away from it for a few days. Sent a reply.

dmpetrov · 2019-01-07T10:00:56Z

@prihoda you are right - you can just rerun a stage with a different inputs\outputs\params. However, to reconfigure a pipeline you have to redefine a whole pipeline each time you reuse it. See the Discord discussion with vern from 11/27/18: "it's annoying to write them all (stages) out by hand and then do it again for each color (parameter)."

Intermediate results should be cached and reused (step1 can be the same for a two different "pipeline calls\instances") if we implement build-cache #1234.

Your example with variable output size is a separate question. The reconfiguration might support variable output size or might not (I see no reason not to support it). So, 1.a looks like a more reasonable solution. I don't whant to make each stage to "know about each other".

PS: I don't see any problem with "reusable pipelines are a world on their own". We have a pretty clear demand for reusable pipelines and it was one of DVC features that I initialy planed to implement but the data\cache part took much more time that I expected. If you have any concerns with this direction - I'd love to hear more.

prihoda · 2019-01-07T10:41:31Z

@dmpetrov yeah the "world on its own" would mostly be a problem if you were going for option 2. So if you are going with option 1.a, what are the new "reconfigurable" features that you have in mind? Providing a storage of pipelines, plus and ability to execute them with custom parameters (command parameters and input and output paths)? Or is it more about the build cache #1234?

dmpetrov · 2019-01-07T11:07:46Z

@pared I've just separated two issues: this one and #1472. And thank you for your comments - it made me clarifying the issue and even renaming it.

This issue is just about defining configs/input/params and how to instantiate a pipeline with a new set of params. The instantiation part might and probably should include #1234.

A store of pipeline is related to the new issue. I don't think any special store is needed. A module can be simply reused from Git repos or just copied.

kaleidoescape · 2019-11-05T17:45:01Z

I believe that this issue may also address our use case (but please correct me if I'm wrong or if you have some nice idea for something else that already addresses it better).

Anyways, in our case, we have some large codebase that has various functions in it which perform different steps in our pipeline. We also have multiple customers that we create models for. We also create multiple models for each customer.

We want to keep the customer data in repos that are separate from the big codebase, and also separate from each other.
We would like to use different steps in our pipeline from our big codebase, depending on the customer, e.g. for customer A we want to use steps (a, b, c, d, e) and for customer B we would like to use steps (b, c, a, e, f). We want to save the output of each step to its own DVC file.
In addition, ideally, we would create some kind of configuration, which gives our data file to the first step in our pipeline, and then passes the outputs of each step along the chain to the other pipeline steps (so that we don't always have to write filepaths everywhere in every single step).
And our pipeline steps are of course not necessarily linear (but I think DVC already addresses this aspect quite well).

I think no matter what, we'd have to do quite a bit of custom work to make everything run smoothly, but I think maybe with reconfigurable DVC pipelines, it'd be a little easier.

jorgeorpinel · 2020-07-21T22:22:12Z

Hi! Resurrecting 🧟

Possible improvement 1: Support input and output wildcards that would allow solution B to work even with files in the same folder

From recurrent feedback from users on support channels, I also came up with this idea recently (see #4254). I think it's still needed (or desirable) even now that we also have parameters.

dmpetrov added the feature request label Dec 31, 2018

dmpetrov mentioned this issue Dec 31, 2018

dvc: consider introducing build matrix #1018

Closed

dmpetrov changed the title ~~Reconfigurable stages~~ Reconfigurable pipelines Jan 7, 2019

dmpetrov mentioned this issue Jan 7, 2019

Reconfigurable modules #1472

Closed

efiop mentioned this issue Feb 5, 2019

Make dvc run handle files with same name but different path #973

Closed

This was referenced Feb 21, 2019

stage: Support a way to specify paths relative to the dvc root #1493

Closed

dvc run: pass dependencies and outputs as params #995

Closed

move: Support changing dependencies of related stages #1489

Closed

efiop mentioned this issue Jun 18, 2019

Directory dependency wrongly reported as changed #2144

Closed

dmpetrov mentioned this issue Feb 22, 2020

ML experiments and hyperparameters tuning #2799

Closed

elgehelge mentioned this issue Mar 29, 2020

Taking a step back and looking at the big picture #3549

Closed

MatthieuBizien mentioned this issue Jun 9, 2020

Incremental processing or streaming in micro-batches #331

Closed

MatthieuBizien mentioned this issue Jul 15, 2020

Allow pipeline to create pipeline #4213

Closed

efiop mentioned this issue Jul 21, 2020

dvc: wildcard outputs? #4254

Closed

skshetry mentioned this issue Sep 23, 2020

Pipeline variables from params file #3633

Closed

jorgeorpinel mentioned this issue Jan 13, 2021

add support for wildcard/patterns #4816

Open

5 tasks

efiop closed this as completed May 3, 2021

iterative locked and limited conversation to collaborators May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Reconfigurable pipelines #1462

Reconfigurable pipelines #1462

dmpetrov commented Dec 31, 2018 •

edited

Loading

prihoda commented Jan 2, 2019

dmpetrov commented Jan 4, 2019 •

edited by shcheklein

Loading

dmpetrov commented Jan 4, 2019 •

edited

Loading

prihoda commented Jan 4, 2019

dmpetrov commented Jan 4, 2019

prihoda commented Jan 5, 2019

prihoda commented Jan 5, 2019

efiop commented Jan 6, 2019

prihoda commented Jan 7, 2019

dmpetrov commented Jan 7, 2019

prihoda commented Jan 7, 2019 •

edited

Loading

dmpetrov commented Jan 7, 2019

kaleidoescape commented Nov 5, 2019

jorgeorpinel commented Jul 21, 2020 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Reconfigurable pipelines #1462

Reconfigurable pipelines #1462

Comments

dmpetrov commented Dec 31, 2018 • edited Loading

prihoda commented Jan 2, 2019

dmpetrov commented Jan 4, 2019 • edited by shcheklein Loading

dmpetrov commented Jan 4, 2019 • edited Loading

prihoda commented Jan 4, 2019

dmpetrov commented Jan 4, 2019

prihoda commented Jan 5, 2019

prihoda commented Jan 5, 2019

efiop commented Jan 6, 2019

prihoda commented Jan 7, 2019

dmpetrov commented Jan 7, 2019

prihoda commented Jan 7, 2019 • edited Loading

dmpetrov commented Jan 7, 2019

kaleidoescape commented Nov 5, 2019

jorgeorpinel commented Jul 21, 2020 • edited Loading

This issue was moved to a discussion.

dmpetrov commented Dec 31, 2018 •

edited

Loading

dmpetrov commented Jan 4, 2019 •

edited by shcheklein

Loading

dmpetrov commented Jan 4, 2019 •

edited

Loading

prihoda commented Jan 7, 2019 •

edited

Loading

jorgeorpinel commented Jul 21, 2020 •

edited

Loading