Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull: pull all parameterized stages (--group) #7462

Closed
sjawhar opened this issue Mar 14, 2022 · 17 comments
Closed

pull: pull all parameterized stages (--group) #7462

sjawhar opened this issue Mar 14, 2022 · 17 comments
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something?

Comments

@sjawhar
Copy link
Contributor

sjawhar commented Mar 14, 2022

We use parametrized stages in our repos a lot (foreach/do). It's a hassle to have to dvc pull stage@parameter for each parameter. It would be great if we could add a --group flag to fetch, pull, and checkout that would run for all parametrized versions of that stage. I thought --glob would do this, but after digging into the code I can see that's not what we want.

I called it --glob by taking the name of the kwarg from StageLoad.load_all(), which is the method called under the hood by these various commands: dvc pull > Repo.pull > Repo.fetch > Repo.used_objs > Index.used_objs > StageLoad.collect_granular > stage._collect_specific_target > stage._maybe_collect_from_dvc_yaml > StageLoad.load_all(). 😅

If there are no objections or if someone isn't already working on it, I'm happy to contribute. I'd just pass the optional accept_group: bool = False down through that call chain.

@karajan1001
Copy link
Contributor

karajan1001 commented Mar 15, 2022

Thanks, good suggestion. And for now. you can use the following method.
For the target of dvc pull it can accept output file as argument. And if there were already these files in your workspace ( you want to update instead of newly downloaded) you can use a wild match pattern to match them.

@karajan1001 karajan1001 added the feature request Requesting a new feature label Mar 15, 2022
@sjawhar
Copy link
Contributor Author

sjawhar commented Mar 15, 2022

Yes, but the problem there is just as you stated: IF you already have them in the workspace.

@dberenbaum
Copy link
Collaborator

I thought --glob would do this, but after digging into the code I can see that's not what we want.

Seems like I should know this 😅 , but would someone mind explaining why --glob cannot do this?

@sjawhar
Copy link
Contributor Author

sjawhar commented Mar 16, 2022

Seems like I should know this sweat_smile , but would someone mind explaining why --glob cannot do this?

--glob was implemented in #5032 to address #4816. It's about paths, which can be a bit confusing because StageLoad.load_all() also takes an optional glob argument, which does indeed glob match the stage names. 😸

@karajan1001
Copy link
Contributor

I thought --glob would do this, but after digging into the code I can see that's not what we want.

Seems like I should know this 😅 , but would someone mind explaining why --glob cannot do this?

For glob, it only matches the targets.

--glob               Allows targets containing shell-style wildcards.

And besides this a big limitation is mentioned before: IF you already have them in the workspace.

For example for a stage train output model.pkl
We have the following result.

$ dvc repro --glob mode?.* # can match `model.pkl` 
Data and pipelines are up to date.
$ dvc repro --glob mode??.* # two ?? here can not match `model.pkl`
zsh: no matches found: mode??.*
$ dvc repro --glob trai* # cannot match stage name `train`
zsh: no matches found: trai*
$ rm model.pkl # if we delete the file `model.pkl`
$ dvc repro --glob mode?.* # cannot match it for now.
zsh: no matches found: mode?.*

@dberenbaum
Copy link
Collaborator

Hm, I was getting confused because of behavior without wildcards:

$ dvc repro --glob mode
Data and pipelines are up to date.
$ dvc repro --glob trai
Data and pipelines are up to date.
$ rm model.pkl
$ dvc repro --glob mode
Data and pipelines are up to date.
$ dvc repro --glob mode*
zsh: no matches found: mode*

It seems without wildcards that --glob is matching everything. Is this expected?

@skshetry
Copy link
Member

skshetry commented Mar 16, 2022

@dberenbaum, the last one is coming from your shell, you have to single-quote it to avoid wildcard expansion by the shell.

@dberenbaum
Copy link
Collaborator

Isn't the same true for the example from @karajan1001?

@karajan1001
Copy link
Contributor

@dberenbaum, the last one is coming from your shell, you have to single-quote it to avoid wildcard expansion by the shell.

I think so, we need to add a " to prevent it from being replaced by the shell.

$ dvc repro --glob "trai*"
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' is cached - skipping run, checking out outputs
Use `dvc push` to send your updates to remote storage.
$ dvc repro "trai*"
ERROR: Stage 'trai*' not found inside 'dvc.yaml' file

So, @sjawhar . you can try this method dvc pull stage@1_*, I tried it on my computer and get the correct file downloaded.

$ rm raw*.cln
dvc pull --glob "raw*.cln"
A       raw1.cln
A       raw2.cln
2 files added

@dberenbaum
Copy link
Collaborator

@karajan1001 I see that repro --glob works as expected, but pull --glob pulls more than expected unless I'm misunderstanding:

$ dvc pull --glob 'feat*'
A       model.pkl
A       data/prepared/
A       data/data.xml
A       data/features/

@sjawhar
Copy link
Contributor Author

sjawhar commented Mar 17, 2022

@karajan1001 that only appears to work for you because it's in fact pulling all stages. You can see this by setting a breakpoint here:

dvc/dvc/repo/pull.py

Lines 13 to 34 in 0797147

@locked
def pull(
self,
targets=None,
jobs=None,
remote=None,
all_branches=False,
with_deps=False,
all_tags=False,
force=False,
recursive=False,
all_commits=False,
run_cache=False,
glob=False,
odb: Optional["ObjectDB"] = None,
):
if isinstance(targets, str):
targets = [targets]
expanded_targets = glob_targets(targets, glob=glob)
processed_files_count = self.fetch(

No targets are matches, so expanded_targets is an empty list and everything gets pulled as a result:
image

@sjawhar
Copy link
Contributor Author

sjawhar commented Mar 17, 2022

@dberenbaum repro uses StageLoad.collect(), which uses StageLoad.load_all(). So it is indeed doing the type of globbing you (and I) would expect. pull instead calls that other glob_targets function, which globs on path. I'm not sure why a --glob option for paths is needed when the shell can do it for you?

@karajan1001 karajan1001 added bug Did we break something? A: data-sync Related to dvc get/fetch/import/pull/push and removed feature request Requesting a new feature labels Mar 18, 2022
@sjawhar
Copy link
Contributor Author

sjawhar commented Mar 21, 2022

So, can I go ahead and implement? I see the issue was tagged as a bug, so should I instead change how the --glob argument to pull works to be consistent with repro?

@pared
Copy link
Contributor

pared commented Mar 22, 2022

Related #7323

@dberenbaum
Copy link
Collaborator

@sjawhar We discussed this one, and you managed to find multiple issues with the globbing features, which need a lot of work 😅 .

Addressing a group should be handled automatically as part of #7323, so simply dvc pull stage should work. Please keep an eye there for changes to address your initial request 🙏 .

Globbing outputs generally should not depend on what is checked out in the workspace. Otherwise, I agree it's not that useful for commands other than dvc add.

There's also confusion because --glob works differently in repro from pull and other commands, as you noticed. There was discussion in #5723 of separating into --glob-stages and/or --glob-outs, but ultimately it may be better to have globbing on by default for both.

@sjawhar
Copy link
Contributor Author

sjawhar commented Mar 23, 2022

OK, so how about I start with a PR that addresses that checklist you posted in #7323? That would actually take care of my needs. Maybe this issue can be closed in favor of #7323 and #7491.

@dberenbaum
Copy link
Collaborator

@sjawhar I'm not sure yet what the plan is to address #7323, and it might require some higher-level changes. @skshetry Can you comment on whether there is anything @sjawhar can do to help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something?
Projects
None yet
Development

No branches or pull requests

5 participants