-
Notifications
You must be signed in to change notification settings - Fork 1.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconfigurable pipelines #1462
Comments
Sounds great, I guess we should think about all the use cases and what the new functionality will bring compared to the existing solutions.
|
@prihoda I agree with both your statements but I need to clarify a couple things. First, I didn't get this part "Parameters could also be incorporated into input and output paths." Does it mean that you would like to change inputs and outputs by parameters? To my mind, inputs, outputs and parameters are different concepts. And DVC should provide and ability to change them separately like
What do you think about this? Second, "Possible improvement 2" looks better to me since our custome wildcards will have more limitations (compared to users scripts\loops) and it might be tricky to implement as you said. Does the code from above PS: I think, I need to clarify "#1119 repetitive commands" part. It is not about loops (loops are fine and probably even better than custom wildcards), it is about running a reconfigurable stage (let's imaging |
I'd add one more use case to the requirement list that might be extremely useful.
In this way, users will be able to create a "library" of reusable stages\pipelines and reuse them from different projects (through copy, Got-submodules or UPDATE: This comment was extracted as a separate feature request #1472 |
@dmpetrov For the inputs and outputs, I meant that they could also contain variables from the config file. But now I see that my suggestions were trying to solve a different, less challenging problem - reproducing an existing stage with changed parameters, inputs or outputs (where all of those could have multiple values - unzipping multiple files, evaluating multiple parameters). So the point of reconfigurable stages is different than I thought - basically you want a named library of "stage templates" or "reusable stages" that define a specific command, right? And wrapping those in "reusable pipelines" that would define a whole pipeline. This would definitely be useful, but you would have to make sure you are not reinventing the wheel. I see that there are two levels to pipeline (workflow) management:
There are loads of workflow managers that operate on the reusable level, see https://github.com/pditommaso/awesome-pipeline, You could imagine writing a workflow in these tools which would actually run each command using I definitely see the benefit in reusable stages, just keep in mind you're entering a whole new world of existing solutions 😄 |
@prihoda It looks like the problems of reconfigurable stages\pipelines and library of stages\pipelines are related and the solutions can complement each other. If we have a way to define a reconfigurable stage why don't we provide an ability to extract this stage from a project and reuse it from a different project? |
Sure a library could be useful. My point is only that reconfigurable/reusable pipelines are a world on their own, with many existing solutions. If I understand it correctly, reconfigurable stages would basically just define a command and its inputs, outputs and other parameters. But isn't that already what any script can do? What would a reconfigurable stage bring as opposed to writing a bash script? So the main contribution would be the reconfigurable pipelines, but again, it would have to provide some benefits over just writing a bash script that calls each step. The main problem is that you don't know the exact intermediate files that will be created when executing the pipeline, since they are based on parameters, e.g. an unknown number of input files or model hyperparameter values. |
For example, let's say you want to create a pipeline that chops a list of files into chunks of 100 lines, sorts the lines in each produced file and then merges the files into one final file. The stages are:
I see two options to define a reconfigurable pipeline:
Are you thinking about solution 1 or 2? Or do you have something else in mind? |
Sorry for interrupting guys, but it seems like my emails are getting lost somewhere. @prihoda i've sent you a few messages and didn't hear back at all, could you please contact me back? Thanks. |
@efiop Sorry I only check my email from my laptop, I was away from it for a few days. Sent a reply. |
@prihoda you are right - you can just rerun a stage with a different inputs\outputs\params. However, to reconfigure a pipeline you have to redefine a whole pipeline each time you reuse it. See the Discord discussion with vern from 11/27/18: "it's annoying to write them all (stages) out by hand and then do it again for each color (parameter)." Intermediate results should be cached and reused (step1 can be the same for a two different "pipeline calls\instances") if we implement build-cache #1234. Your example with variable output size is a separate question. The reconfiguration might support variable output size or might not (I see no reason not to support it). So, 1.a looks like a more reasonable solution. I don't whant to make each stage to "know about each other". PS: I don't see any problem with "reusable pipelines are a world on their own". We have a pretty clear demand for reusable pipelines and it was one of DVC features that I initialy planed to implement but the data\cache part took much more time that I expected. If you have any concerns with this direction - I'd love to hear more. |
@dmpetrov yeah the "world on its own" would mostly be a problem if you were going for option 2. So if you are going with option 1.a, what are the new "reconfigurable" features that you have in mind? Providing a storage of pipelines, plus and ability to execute them with custom parameters (command parameters and input and output paths)? Or is it more about the build cache #1234? |
@pared I've just separated two issues: this one and #1472. And thank you for your comments - it made me clarifying the issue and even renaming it. This issue is just about defining configs/input/params and how to instantiate a pipeline with a new set of params. The instantiation part might and probably should include #1234. A store of pipeline is related to the new issue. I don't think any special store is needed. A module can be simply reused from Git repos or just copied. |
I believe that this issue may also address our use case (but please correct me if I'm wrong or if you have some nice idea for something else that already addresses it better). Anyways, in our case, we have some large codebase that has various functions in it which perform different steps in our pipeline. We also have multiple customers that we create models for. We also create multiple models for each customer.
I think no matter what, we'd have to do quite a bit of custom work to make everything run smoothly, but I think maybe with reconfigurable DVC pipelines, it'd be a little easier. |
Hi! Resurrecting 🧟
From recurrent feedback from users on support channels, I also came up with this idea recently (see #4254). I think it's still needed (or desirable) even now that we also have parameters. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Many issues require reconfiguration of stages and even pipelines:
dvc run
handle files with same name but different path #973 @Hong-Xiang was asking about reusing (reconfigurable) pipelines.dvc run
commands (like unpacking of many zip files)? #1119 repetitive commands. I see a similarity with parametrizable commands where only a single output is in use and without creating a separate directory for each experiment (./output.p
instead ofgs1/output.p
).A concept of reconfigurable-stage should be introduced in DVC.
Open questions:
gs1/
)?./output.p
instead ofgs1/output.p
from the above)?UPDATE: #1214 might be also related to this issue.
UPDATE2: Add a quote from vern and open question 7.
The text was updated successfully, but these errors were encountered: