Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconfigurable pipelines #1462

Closed
dmpetrov opened this issue Dec 31, 2018 · 14 comments
Closed

Reconfigurable pipelines #1462

dmpetrov opened this issue Dec 31, 2018 · 14 comments
Labels
feature request Requesting a new feature

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Dec 31, 2018

Many issues require reconfiguration of stages and even pipelines:

A concept of reconfigurable-stage should be introduced in DVC.

Open questions:

  1. how to pass config file (do we need multiple config files)?
  2. how to pass params (should we override config by the params or it is a separate concept)?
  3. how to specify input files?
  4. how to specify the output directory (gs1/)?
  5. how to specify an output file without an output directory ( see ./output.p instead of gs1/output.p from the above)?
  6. how to make it work for pipelines?
  7. Should build cache repro: use build cache for deterministic stages #1234 be part of the solution? It will allow caching intermediate results of reusable stages (if step1 is the same in a few pipeline instances).

UPDATE: #1214 might be also related to this issue.
UPDATE2: Add a quote from vern and open question 7.

@dmpetrov dmpetrov added the feature request Requesting a new feature label Dec 31, 2018
@prihoda
Copy link
Contributor

prihoda commented Jan 2, 2019

Sounds great, I guess we should think about all the use cases and what the new functionality will bring compared to the existing solutions.

  1. Running an experiment that reproduces the whole pipeline with new parameters (probably in a separate branch).
    • Current solution A: Manually edit the parameters in each stage file and reproduce the pipeline.
    • Current solution B: Use source config.sh before each parametrized command. Input and output paths cannot change.
    • Possible improvement: Global DVC config file loaded automatically before each command pipelines: parametrize using environment variables / DVC properties #1416. Parameters could also be incorporated into input and output paths.
  2. Keeping track of multiple outputs generated by the same command and different parameters (in the same branch).
    • Current solution A: Duplicate stages with the same command, just different parameters. The stages can be created manually or using a for cycle (for i in {1..10}; do dvc run cmd --option $i; done). Input and output wildcards can be used, although only when first running the command - they will be hardcoded in the stage file after that.
    • Current solution B: Single stage containing a for cycle. Input and output wildcards can be used and if they are contained inside a folder, they won't have to be hardcoded since the input/output dependency will only specify the folder (useful together with run/repro: add option to not remove outputs before reproduction #1214 to solve How to manage repetitive dvc run commands (like unpacking of many zip files)? #1119).
    • Possible improvement 1: Support input and output wildcards that would allow solution B to work even with files in the same folder. The stage command could be run once for all inputs (unzip *.zip - possible even when multiple input/output paths contain wildcards) or separately for each file which would be provided as a variable (unzip $INPUT - only possible if only one input/output path contains wildcards). Providing a corresponding output path would probably have to be handled by the user (png2jpg $INPUT ${INPUT/.png/.jpg}).
      • Would have to use quotes when executing: dvc run -d 'input*.zip' -o 'output*.zip' 'unzip*.zip'. Might be solved using prompt Get run command from stdout #1415.
      • Might be tricky to implement, but should be possible by taking all the inputs/outputs matching the wildcard and handling them as an imaginary folder (which can be empty or contain any number of files). Just checking output path duplicates would be harder, since you would have to make sure no path can match both wildcards.
    • Possible improvement 2: Support specifying parameters as part of the stage file to avoid for cycles (dvc: consider introducing build matrix #1018 (comment))

@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 4, 2019

@prihoda I agree with both your statements but I need to clarify a couple things.

First, I didn't get this part "Parameters could also be incorporated into input and output paths." Does it mean that you would like to change inputs and outputs by parameters? To my mind, inputs, outputs and parameters are different concepts. And DVC should provide and ability to change them separately like

dvc run-reconf my_reconf_stage --input raw/input.csv --output output.p --config myconf1.yaml --param alpha=0.57 --param steps=400

What do you think about this?

Second, "Possible improvement 2" looks better to me since our custome wildcards will have more limitations (compared to users scripts\loops) and it might be tricky to implement as you said. Does the code from above dvc run-reconf my_reconf_stage ... look like your "Possible improvement 2"?

PS: I think, I need to clarify "#1119 repetitive commands" part. It is not about loops (loops are fine and probably even better than custom wildcards), it is about running a reconfigurable stage (let's imaging unzip is a reconf stage) and we would like to have an output of this stage in any place of our project (not only in a specific stage directory).

@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 4, 2019

I'd add one more use case to the requirement list that might be extremely useful.

  • we should provide an ability to "import" stages\pipelines from external places like Git repositories.

In this way, users will be able to create a "library" of reusable stages\pipelines and reuse them from different projects (through copy, Got-submodules or git clone https://my-dvc-repo).

UPDATE: This comment was extracted as a separate feature request #1472

@prihoda
Copy link
Contributor

prihoda commented Jan 4, 2019

@dmpetrov For the inputs and outputs, I meant that they could also contain variables from the config file. But now I see that my suggestions were trying to solve a different, less challenging problem - reproducing an existing stage with changed parameters, inputs or outputs (where all of those could have multiple values - unzipping multiple files, evaluating multiple parameters).

So the point of reconfigurable stages is different than I thought - basically you want a named library of "stage templates" or "reusable stages" that define a specific command, right? And wrapping those in "reusable pipelines" that would define a whole pipeline. This would definitely be useful, but you would have to make sure you are not reinventing the wheel.

I see that there are two levels to pipeline (workflow) management:

  • Reproducible level providing a versioned "archive" of a single specific pipeline execution, including data (DVC, as you call experiment management software)
  • Reusable level providing a reusable pipeline definition, basically serving as a way to easily wrap a lot of depending subtasks into an executable "program" (often called workflow management software)

There are loads of workflow managers that operate on the reusable level, see https://github.com/pditommaso/awesome-pipeline,
e.g. Cromwell using the Workflow Definition Language: https://github.com/openwdl/wdl

You could imagine writing a workflow in these tools which would actually run each command using dvc run, providing a re-usable pipeline that produces the data along with their DVC stages, all of which could be stored in GIT.

I definitely see the benefit in reusable stages, just keep in mind you're entering a whole new world of existing solutions 😄

@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 4, 2019

@prihoda It looks like the problems of reconfigurable stages\pipelines and library of stages\pipelines are related and the solutions can complement each other. If we have a way to define a reconfigurable stage why don't we provide an ability to extract this stage from a project and reuse it from a different project?

@prihoda
Copy link
Contributor

prihoda commented Jan 5, 2019

Sure a library could be useful. My point is only that reconfigurable/reusable pipelines are a world on their own, with many existing solutions.

If I understand it correctly, reconfigurable stages would basically just define a command and its inputs, outputs and other parameters. But isn't that already what any script can do? What would a reconfigurable stage bring as opposed to writing a bash script?

So the main contribution would be the reconfigurable pipelines, but again, it would have to provide some benefits over just writing a bash script that calls each step. The main problem is that you don't know the exact intermediate files that will be created when executing the pipeline, since they are based on parameters, e.g. an unknown number of input files or model hyperparameter values.

@prihoda
Copy link
Contributor

prihoda commented Jan 5, 2019

For example, let's say you want to create a pipeline that chops a list of files into chunks of 100 lines, sorts the lines in each produced file and then merges the files into one final file. The stages are:

  • chunk NUM_LINES IN_FILE which turns MYFILE into MYFILE.0 MYFILE.100 MYFILE.200 etc.
  • sort IN_PATH OUT_PATH which sorts lines in the file alphanumerically
  • merge IN_PATH1 .. IN_PATHX OUT_PATH which merges multiple files into one file

I see two options to define a reconfigurable pipeline:

  1. The current "DVC way" as a graph, where stage files are edges. The pipeline is represented as a folder structure with pre-created DVC files. Each stage would need to input and output a dynamic list of files, which requires a) for cycles in the command or a b) "DVC native" way to define a cycle inside a stage:
    a. Using for cycles in the command can be done with lists or wildcards, e.g. for file in input/*.txt; do ...; done. Supporting wildcards in input/output paths would remove the need to store a folder with the results of each intermediate stage.
    b. Using a "DVC native" stage cycle that runs the command for each parameter value. Input/output wildcards could also be used.
  2. Using some kind of a pipeline definition file, with a custom workflow definition language (like WDL https://github.com/openwdl/wdl), where the stages "know about each other" and can access all parameters. This is basically like writing a BASH file just with more structure.

Are you thinking about solution 1 or 2? Or do you have something else in mind?

@efiop
Copy link
Contributor

efiop commented Jan 6, 2019

Sorry for interrupting guys, but it seems like my emails are getting lost somewhere. @prihoda i've sent you a few messages and didn't hear back at all, could you please contact me back? Thanks.

@prihoda
Copy link
Contributor

prihoda commented Jan 7, 2019

@efiop Sorry I only check my email from my laptop, I was away from it for a few days. Sent a reply.

@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 7, 2019

@prihoda you are right - you can just rerun a stage with a different inputs\outputs\params. However, to reconfigure a pipeline you have to redefine a whole pipeline each time you reuse it. See the Discord discussion with vern from 11/27/18: "it's annoying to write them all (stages) out by hand and then do it again for each color (parameter)."

Intermediate results should be cached and reused (step1 can be the same for a two different "pipeline calls\instances") if we implement build-cache #1234.

Your example with variable output size is a separate question. The reconfiguration might support variable output size or might not (I see no reason not to support it). So, 1.a looks like a more reasonable solution. I don't whant to make each stage to "know about each other".

PS: I don't see any problem with "reusable pipelines are a world on their own". We have a pretty clear demand for reusable pipelines and it was one of DVC features that I initialy planed to implement but the data\cache part took much more time that I expected. If you have any concerns with this direction - I'd love to hear more.

@dmpetrov dmpetrov changed the title Reconfigurable stages Reconfigurable pipelines Jan 7, 2019
@prihoda
Copy link
Contributor

prihoda commented Jan 7, 2019

@dmpetrov yeah the "world on its own" would mostly be a problem if you were going for option 2. So if you are going with option 1.a, what are the new "reconfigurable" features that you have in mind? Providing a storage of pipelines, plus and ability to execute them with custom parameters (command parameters and input and output paths)? Or is it more about the build cache #1234?

@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 7, 2019

@pared I've just separated two issues: this one and #1472. And thank you for your comments - it made me clarifying the issue and even renaming it.

This issue is just about defining configs/input/params and how to instantiate a pipeline with a new set of params. The instantiation part might and probably should include #1234.

A store of pipeline is related to the new issue. I don't think any special store is needed. A module can be simply reused from Git repos or just copied.

@kaleidoescape
Copy link

I believe that this issue may also address our use case (but please correct me if I'm wrong or if you have some nice idea for something else that already addresses it better).

Anyways, in our case, we have some large codebase that has various functions in it which perform different steps in our pipeline. We also have multiple customers that we create models for. We also create multiple models for each customer.

  1. We want to keep the customer data in repos that are separate from the big codebase, and also separate from each other.
  2. We would like to use different steps in our pipeline from our big codebase, depending on the customer, e.g. for customer A we want to use steps (a, b, c, d, e) and for customer B we would like to use steps (b, c, a, e, f). We want to save the output of each step to its own DVC file.
  3. In addition, ideally, we would create some kind of configuration, which gives our data file to the first step in our pipeline, and then passes the outputs of each step along the chain to the other pipeline steps (so that we don't always have to write filepaths everywhere in every single step).
  4. And our pipeline steps are of course not necessarily linear (but I think DVC already addresses this aspect quite well).

I think no matter what, we'd have to do quite a bit of custom work to make everything run smoothly, but I think maybe with reconfigurable DVC pipelines, it'd be a little easier.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jul 21, 2020

Hi! Resurrecting 🧟

Possible improvement 1: Support input and output wildcards that would allow solution B to work even with files in the same folder

From recurrent feedback from users on support channels, I also came up with this idea recently (see #4254). I think it's still needed (or desirable) even now that we also have parameters.

@efiop efiop closed this as completed May 3, 2021
@iterative iterative locked and limited conversation to collaborators May 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

5 participants