Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repo: Add repro from given out names functionality #5273

Closed
wants to merge 16 commits into from

Conversation

Vonski
Copy link

@Vonski Vonski commented Jan 14, 2021

I'm not sure if I should have added special test for this functionality.

Should the --glob functionality be preserved? If so, should handling for it be added to collect_granular(...)?

Fixes #3875

@pared

@pared pared assigned pared and unassigned pared Jan 15, 2021
@pared pared self-requested a review January 15, 2021 14:08
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment,
In any case, test in test_repro veryfing that repro by output is possible would be nice.

accept_group=accept_group,
glob=glob,
)
granular_stages = self.stage.collect_granular(
Copy link
Contributor

@pared pared Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we are to change this method, we will effectively make glob stop working. We should not do that.
Ideally we should support glob for collect_granular.
But, we could also use collect_granular as a fallback in a situation when glob is not provided and collect returns no results. That way it will be easier but we will need to create issue for supporting glob in collect_granular and do it in the future anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approach is inconsistent, because it would not let user to do repro by outs with glob option. I'll try to implement glob handling logic in collect_granular. One more question. There wasn't any tests fails connected to missing glob functionality. Should I prepare a test that validates if globbing is possible?

This comment was marked as resolved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, as far as I understand the only acceptable solution is to upgrade collect_granular to support globbing and accept_group. May I try to do it here?

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vonski Sure, go ahead, if you stumble upon any problems, feel free to ping us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now collect_granular handles glob also.

@Vonski
Copy link
Author

Vonski commented Jan 18, 2021

One comment,
In any case, test in test_repro veryfing that repro by output is possible would be nice.

I've added one, is this check correct for this functionality?

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jan 24, 2021
@efiop efiop changed the title repo: Add repro from given out names functionality [WIP] repo: Add repro from given out names functionality Jan 24, 2021
@Vonski Vonski changed the title [WIP] repo: Add repro from given out names functionality repo: Add repro from given out names functionality Feb 11, 2021
@pared pared added enhancement Enhances DVC ui user interface / interaction and removed awaiting response we are waiting for your reply, please respond! :) labels Feb 11, 2021
@pared pared self-requested a review February 14, 2021 08:21
@pared pared requested review from a team and removed request for a team February 14, 2021 08:21
@efiop efiop requested a review from skshetry March 3, 2021 13:28
@@ -410,6 +413,9 @@ def func(out):
if recursive and out.path_info.isin(path_info):
return True

if glob:
return fnmatch.fnmatch(out.path_info.name, path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is a basename, we should be checking for the str(path_info) which should also make it work with ../data/** kind of globs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed that and added new test to check how globbing with directories in path works.

@@ -299,14 +303,16 @@ def collect_granular(
return [StageInfo(stage) for stage in self.repo.stages]

stages, file, _ = _collect_specific_target(
self, target, with_deps, recursive, accept_group
self, target, with_deps, recursive, accept_group, glob=glob
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that repro will support output name as a target, I don't think we should glob for the stage names at all. Or, we could think of a different flags to glob stage names.

Copy link
Member

@skshetry skshetry Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this way, if the glob does match within the dvc.yaml, it will try not to collect from other stages. Globbing outputs should be global.

@skshetry
Copy link
Member

skshetry commented Mar 4, 2021

cc'ing @dberenbaum @shcheklein

Right now, we have a --glob only for the stage names. I'd say that we change it to --glob-stages and use --glob for outputs-based glob. 🙂

@pared
Copy link
Contributor

pared commented Mar 5, 2021

@Vonski seems like we have a consensus. We need another glob flag. As @skshetry mentioned --glob-stages should be for globbing in context of stages, and --glob should be for outputs. Would you be willing to introduce this?

@Vonski
Copy link
Author

Vonski commented Mar 5, 2021

@Vonski seems like we have a consensus. We need another glob flag. As @skshetry mentioned --glob-stages should be for globbing in context of stages, and --glob should be for outputs. Would you be willing to introduce this?

@pared Yeah, sure. I assume that it should be introduced in this PR. Correct me if I'm wrong.

@pared
Copy link
Contributor

pared commented Mar 5, 2021

@Vonski yes, the change shall be included here

@skshetry
Copy link
Member

@Vonski, ping. What's the status of this?

Looking at this, it seems we are asking quite a big feature/change. If you feel the same, how about we start by supporting the repro from given output name (just the title of the PR)? We could introduce other changes in successive PRs. What do you think? 🙂

@Vonski
Copy link
Author

Vonski commented Mar 17, 2021

@skshetry Honestly, I don't know. I couldn't find time to think about it yet. It looks like today I'll have some, so in 12 hours from now I should be able to answer the question, or maybe even to push some implemented solution. What do you say?

@Vonski
Copy link
Author

Vonski commented Mar 17, 2021

@skshetry @pared I pushed a preview of my idea of mutually exclusive split of old glob flag, into new glob for outs paths, and glob_stages for stages names. Please check if the logic of reproduce is correct. Is this assumption about mutual exclusiveness acceptable?

As I understand, I'll also need to change some code for command repro to accept new --glob-stages flag. Is that correct? If so, I would appreciate if you could point out the files that should be modified to achieve that.

Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vonski

Is this assumption about mutual exclusiveness acceptable

Yes, if there will be need to support that, we will continue with this change. For now is fine. We should probably check in the repro command and throw Exception mentioning that both cannot be provided.

As I understand, I'll also need to change some code for command repro to accept new --glob-stages flag.

Correct, here is what I think needs to be done:

@Vonski
Copy link
Author

Vonski commented Mar 24, 2021

@pared I hope to find some time for it before the end of the week, fyi.

@Vonski
Copy link
Author

Vonski commented Mar 28, 2021

@pared I added requested changes.

dvc/command/repro.py Outdated Show resolved Hide resolved
dvc/command/repro.py Outdated Show resolved Hide resolved
dvc/repo/stage.py Outdated Show resolved Hide resolved
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 7, 2021

Right now, we have a --glob only for the stage names. I'd say that we change it to --glob-stages and use --glob for outputs-based glob.

Q. Couldn't changing the behavior of --glob be a breaking change for some users? (Better leave for 3.0?)

p.s. I do like that it's more consistent with add/pull/push --glob now though.

Co-authored-by: Jorge Orpinel <jorgeorpinel@users.noreply.github.com>
Vonski and others added 3 commits April 9, 2021 15:32
Co-authored-by: Jorge Orpinel <jorgeorpinel@users.noreply.github.com>
Co-authored-by: Jorge Orpinel <jorgeorpinel@users.noreply.github.com>
dvc/command/repro.py Outdated Show resolved Hide resolved
@pared
Copy link
Contributor

pared commented Apr 16, 2021

Ok, it seems to me that we are stretching this too much, @jorgeorpinel it seems that besides some suggestions it's ok, right? @Vonski feel free to accept the suggestions and let's merge this change, docs can be finished later.

Co-authored-by: Jorge Orpinel <jorgeorpinel@users.noreply.github.com>
@pared
Copy link
Contributor

pared commented Apr 16, 2021

@jorgeorpinel

Couldn't changing the behavior of --glob be a breaking change for some users? (Better leave for 3.0?)

Yes, this is a breaking change, however, after it, we still would be able to retain "old" behavior with the new flag --glob-stages. I think that because of that we should proceed and involve this change in the current version.

@dberenbaum @shcheklein am I right here?

@dberenbaum
Copy link
Collaborator

@jorgeorpinel

Couldn't changing the behavior of --glob be a breaking change for some users? (Better leave for 3.0?)

Yes, this is a breaking change, however, after it, we still would be able to retain "old" behavior with the new flag --glob-stages. I think that because of that we should proceed and involve this change in the current version.

@dberenbaum @shcheklein am I right here?

Hm, if someone is scripting their use of dvc repro --glob (like in cml), then this seems like it's going to break their workflow, so I'm not sure it's a good idea to introduce without sufficient warning. Maybe it would be better to go with --glob/--glob-outs instead of --glob/--glob-stages?

By the way, forgetting about glob, what happens if there is a stage and an output in a different stage with the same name?

@pared
Copy link
Contributor

pared commented Apr 19, 2021

@dberenbaum

By the way, forgetting about glob, what happens if there is a stage and an output in a different stage with the same name?

In that case, both stages shall be reproduced.

@pared
Copy link
Contributor

pared commented Apr 19, 2021

@dberenbaum

I'm not sure it's a good idea to introduce without sufficient warning. Maybe it would be better to go with --glob/--glob-outs instead of --glob/--glob-stages?

It is possible that that we might break some users scripts. I see 2 solutions to this problem:

  1. accept the fact that something might fail
  2. use --glob-outs/--glob approach.

While I am not a fan of breaking the consistency between minor dvc versions, it seems to me that in this case it would be justified by the fact (noted by @jorgeorpinel) that add/pull/push commands target outputs when using --glob.

So if we go with --glob-outs that puts us in situation where one needs to use different option when using add/pull/push (--glob) and other one (--glob-outs) when using repro command. From UI point of view it would seem to me to be tiresome to remember that fact.

@dberenbaum
Copy link
Collaborator

Before 1.0, users always specified outputs as targets to repro, right?

By the way, forgetting about glob, what happens if there is a stage and an output in a different stage with the same name?

In that case, both stages shall be reproduced.

Hmm, so do we recommend users avoid this? How would someone repro one or the other? Is this the same behavior for push/pull?

While I am not a fan of breaking the consistency between minor dvc versions, it seems to me that in this case it would be justified by the fact (noted by @jorgeorpinel) that add/pull/push commands target outputs when using --glob.

Makes sense, but those commands are inherently about files and repro is inherently about stages. Do push/pull glob stages?

@pared
Copy link
Contributor

pared commented Apr 20, 2021

Hmm, so do we recommend users avoid this? How would someone repro one or the other? Is this the same behavior for push/pull?

We don't do that in any way - glob option is there to switch this on or off. If one wants to reproduce only particular output, dvc repro should be called for particular stage. Using glob is a convenience method allowing user to target multiple stages containing assets matching the target. In case of push/ pull its similar - one should not use glob if aiming to interact with some particular target. If users would still like to use glob to repro only particular output, full path needs to be provided.

Makes sense, but those commands are inherently about files and repro is inherently about stages. Do push/pull glob stages?

No, it does not.

@dberenbaum
Copy link
Collaborator

Thanks @pared, makes sense!

I'm still unsure about introducing a breaking change in a minor release like this. Would be interested in getting thoughts from others @jorgeorpinel @efiop. Is changing from --glob to glob for stages to --glob/--glob-stages to glob for files/stages enough of a breaking change to warrant holding off until a major release?

@efiop
Copy link
Contributor

efiop commented Apr 21, 2021

Btw, --glob is like that right now because we want to turn it on by default once we are ready. And we won't be able to turn both --glob and --glob-stages on without mixing them together. And that's actually probably what we really want here in the long term: to make --glob work for both stages and outputs and even dvc-files - that would be the best ui, really. So right now it would be nice to merge --glob and --glob-stages into one --glob, if that makes sense.

@skshetry considering the implications that we already have in collect logic for outputs/stages/granular outputs, what do you think about ^^^ ?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 21, 2021

better to go with --glob/--glob-outs ? ... add/pull/push commands target outputs when using --glob.

I'd also err on the side of not breaking anyone's scripts for now.

Also keep in mind repro may be phased out into 3.0? In favor of exp run. Will this new targeting (+ globing) make sense for exp run? (Do you "run" a data artifact)?

if there is a stage and an output (in a different stage) with the same name ... In that case, both stages shall be reproduced

Would be ideal to error-out instead with an appropriate message, and a way to specify whether you meant the stage or the output. But prob not critical so could be left for a follow-up PR. And again will this all pass on from repro to exp run? 🙂

would be nice to merge --glob and --glob-stages into one --glob

Agree. Another route to consider: leave --glob changes out of this PR if possible (wouldn't support file/dir paths for now in repro).

@skshetry
Copy link
Member

And we won't be able to turn both --glob and --glob-stages on without mixing them together.

It will be kinda confusing mixing outputs and stages both for glob. I have been against the change for supporting output names as a target in repro, but with glob too in the mix, it adds more confusion.

Internally, it complicates things as well, as it's already quite hard to support multiple targets, optimizing for stages names and directories and provide good error message as well. So I'd prefer avoiding it as much as possible.

even dvc-files

Why would we need dvc-files here?

@dberenbaum, @jorgeorpinel, @efiop, as mentioned by @pared, I don't think the glob discussion is worth it for blocking this PR, especially considering the PR is sitting there for more than 3 months now. Let's discuss this in a separate issue.

@efiop
Copy link
Contributor

efiop commented Apr 22, 2021

Thanks for a great discussion, guys! 🙏

We had a great discussion with @skshetry about this today and so, to summarize: we can't merge this as is, as it is breaking backward compatibility and creates a problem for us in the near future when we will enable --glob by default. The latter problem applies to alternative --glob-outs approach, so it is also a nogo. That being said, I see that @skshetry has drawn a pretty nice plan in #5273 (comment) , where he raised a concern that this PR is doing too much, and I tend to agree with him. In order to support outs globbing we need to first support using outputs in a separate PR (along with figuring out a way to distinguish output name from a stage name in a deterministic way, which is not trivial) and then we can come back to the globbing itself that should be part of regular --glob (there are lots of questions about distinguishing stages/outputs there too, so it will be quite a substantial effort to get it right).

@Vonski Thank you so much for your contributions! 🙏 I hope you are not feeling discouraged by this, it is just a classic issue that turned out to be much more complex than originally anticipated, and all of these discussions and research are arguably an even bigger contribution to the project than the PR itself, so we truly appreciate it.

So there are three paths we could take from here:

  1. close this PR and start from scratch, tackling output support first, merging it and then proceeding with globbing
  2. repurpose this PR for only tackling outputs (without globbing) and only then proceeding with globbing in a separate PR
  3. considering that the original issue doesn't seem to be in high demand (it is a bit surprising, but still), we might just decide that less is more and not implement it for now, until there is a clear demand from users

@Vonski What do think?

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Apr 23, 2021
@efiop
Copy link
Contributor

efiop commented Apr 26, 2021

Closing for now.

@efiop efiop closed this Apr 26, 2021
@Vonski
Copy link
Author

Vonski commented Apr 26, 2021

Ah sorry I'm late. I would appreciate to start from scratch and try to push issue forward anyway. I think 1. option is the best idea under the circumstances. I am also a little more engaged in some other things right now, so I have less time to make things happen here, but if it is not a problem, I want to help with that. I hope I'll find some time this week to create another PR for this.

@efiop
Copy link
Contributor

efiop commented Apr 26, 2021

@Vonski Sounds good! Thank you! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) enhancement Enhances DVC ui user interface / interaction
Projects
None yet
Development

Successfully merging this pull request may close these issues.

repro: support running stage via it's output's name
7 participants