-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repo: Add repro from given out names functionality #5273
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment,
In any case, test in test_repro
veryfing that repro by output is possible would be nice.
accept_group=accept_group, | ||
glob=glob, | ||
) | ||
granular_stages = self.stage.collect_granular( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if we are to change this method, we will effectively make glob
stop working. We should not do that.
Ideally we should support glob
for collect_granular
.
But, we could also use collect_granular
as a fallback in a situation when glob
is not provided and collect
returns no results. That way it will be easier but we will need to create issue for supporting glob
in collect_granular
and do it in the future anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second approach is inconsistent, because it would not let user to do repro
by outs with glob
option. I'll try to implement glob
handling logic in collect_granular
. One more question. There wasn't any tests fails connected to missing glob
functionality. Should I prepare a test that validates if globbing is possible?
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, as far as I understand the only acceptable solution is to upgrade collect_granular
to support globbing and accept_group
. May I try to do it here?
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vonski Sure, go ahead, if you stumble upon any problems, feel free to ping us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now collect_granular
handles glob
also.
I've added one, is this check correct for this functionality? |
dvc/repo/__init__.py
Outdated
@@ -410,6 +413,9 @@ def func(out): | |||
if recursive and out.path_info.isin(path_info): | |||
return True | |||
|
|||
if glob: | |||
return fnmatch.fnmatch(out.path_info.name, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name
is a basename, we should be checking for the str(path_info)
which should also make it work with ../data/**
kind of globs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed that and added new test to check how globbing with directories in path works.
dvc/repo/stage.py
Outdated
@@ -299,14 +303,16 @@ def collect_granular( | |||
return [StageInfo(stage) for stage in self.repo.stages] | |||
|
|||
stages, file, _ = _collect_specific_target( | |||
self, target, with_deps, recursive, accept_group | |||
self, target, with_deps, recursive, accept_group, glob=glob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that repro
will support output name as a target, I don't think we should glob
for the stage names at all. Or, we could think of a different flags to glob stage names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this way, if the glob does match within the dvc.yaml
, it will try not to collect from other stages. Globbing outputs should be global.
cc'ing @dberenbaum @shcheklein Right now, we have a |
@pared Yeah, sure. I assume that it should be introduced in this PR. Correct me if I'm wrong. |
@Vonski yes, the change shall be included here |
@Vonski, ping. What's the status of this? Looking at this, it seems we are asking quite a big feature/change. If you feel the same, how about we start by supporting the |
@skshetry Honestly, I don't know. I couldn't find time to think about it yet. It looks like today I'll have some, so in 12 hours from now I should be able to answer the question, or maybe even to push some implemented solution. What do you say? |
@skshetry @pared I pushed a preview of my idea of mutually exclusive split of old As I understand, I'll also need to change some code for command repro to accept new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assumption about mutual exclusiveness acceptable
Yes, if there will be need to support that, we will continue with this change. For now is fine. We should probably check in the repro
command and throw Exception mentioning that both cannot be provided.
As I understand, I'll also need to change some code for command repro to accept new --glob-stages flag.
Correct, here is what I think needs to be done:
- - add
--glob-stages
in repro command - take a look at--glob
to see how to do that - - parse
--glob-stages
in _repro_kwargs - - fix test for parsing command
@pared I hope to find some time for it before the end of the week, fyi. |
@pared I added requested changes. |
Q. Couldn't changing the behavior of p.s. I do like that it's more consistent with |
Co-authored-by: Jorge Orpinel <jorgeorpinel@users.noreply.github.com>
Co-authored-by: Jorge Orpinel <jorgeorpinel@users.noreply.github.com>
Co-authored-by: Jorge Orpinel <jorgeorpinel@users.noreply.github.com>
Ok, it seems to me that we are stretching this too much, @jorgeorpinel it seems that besides some suggestions it's ok, right? @Vonski feel free to accept the suggestions and let's merge this change, docs can be finished later. |
Co-authored-by: Jorge Orpinel <jorgeorpinel@users.noreply.github.com>
Yes, this is a breaking change, however, after it, we still would be able to retain "old" behavior with the new flag @dberenbaum @shcheklein am I right here? |
Hm, if someone is scripting their use of By the way, forgetting about glob, what happens if there is a stage and an output in a different stage with the same name? |
In that case, both stages shall be reproduced. |
It is possible that that we might break some users scripts. I see 2 solutions to this problem:
While I am not a fan of breaking the consistency between minor dvc versions, it seems to me that in this case it would be justified by the fact (noted by @jorgeorpinel) that So if we go with |
Before 1.0, users always specified outputs as targets to
Hmm, so do we recommend users avoid this? How would someone repro one or the other? Is this the same behavior for
Makes sense, but those commands are inherently about files and |
We don't do that in any way - glob option is there to switch this on or off. If one wants to reproduce only particular output,
No, it does not. |
Thanks @pared, makes sense! I'm still unsure about introducing a breaking change in a minor release like this. Would be interested in getting thoughts from others @jorgeorpinel @efiop. Is changing from |
Btw, @skshetry considering the implications that we already have in |
I'd also err on the side of not breaking anyone's scripts for now. Also keep in mind
Would be ideal to error-out instead with an appropriate message, and a way to specify whether you meant the stage or the output. But prob not critical so could be left for a follow-up PR. And again will this all pass on from
Agree. Another route to consider: leave |
It will be kinda confusing mixing outputs and stages both for glob. I have been against the change for supporting output names as a target in Internally, it complicates things as well, as it's already quite hard to support multiple targets, optimizing for stages names and directories and provide good error message as well. So I'd prefer avoiding it as much as possible.
Why would we need dvc-files here? @dberenbaum, @jorgeorpinel, @efiop, as mentioned by @pared, I don't think the |
Thanks for a great discussion, guys! 🙏 We had a great discussion with @skshetry about this today and so, to summarize: we can't merge this as is, as it is breaking backward compatibility and creates a problem for us in the near future when we will enable --glob by default. The latter problem applies to alternative @Vonski Thank you so much for your contributions! 🙏 I hope you are not feeling discouraged by this, it is just a classic issue that turned out to be much more complex than originally anticipated, and all of these discussions and research are arguably an even bigger contribution to the project than the PR itself, so we truly appreciate it. So there are three paths we could take from here:
@Vonski What do think? |
Closing for now. |
Ah sorry I'm late. I would appreciate to start from scratch and try to push issue forward anyway. I think 1. option is the best idea under the circumstances. I am also a little more engaged in some other things right now, so I have less time to make things happen here, but if it is not a problem, I want to help with that. I hope I'll find some time this week to create another PR for this. |
@Vonski Sounds good! Thank you! 🙏 |
I'm not sure if I should have added special test for this functionality.
Should the
--glob
functionality be preserved? If so, should handling for it be added tocollect_granular(...)
?Fixes #3875
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
@pared