Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for wildcard/patterns #4816

Open
5 tasks
dmpetrov opened this issue Oct 31, 2020 · 17 comments
Open
5 tasks

add support for wildcard/patterns #4816

dmpetrov opened this issue Oct 31, 2020 · 17 comments
Labels
feature is a feature p2-medium Medium priority, should be done, but less important

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Oct 31, 2020

Sometimes only a subset of files is needed when the user runs import or pull data from a data directory. It is convenient to define a file pattern for an import.

From https://discuss.dvc.org/t/working-with-a-small-subset-of-remote-data/541
Related: #4705, #4815

Patterns to implement:

  • simple wildcard dvc pull cats-dogs/data/train/dogs/*.img
  • whole wildcard dvc pull cats-dogs/data/train/{dogs,cats}/???.img
  • globstar/ricursive dvc pull cats-dogs/data/train/**/*.img
  • iterator dvc pull cats-dogs/data/train/dogs/%C.img?counter=1:100
  • date dvc pull users/%Y/%m/%d/users.csv?startdata=2020-09-01,enddate=now,ignoremissing

The first three patterns should use a regular Unix file syntax. While the last two require a special language to define the pattern - we need to find a good examples.

@dmpetrov dmpetrov added the feature is a feature label Oct 31, 2020
@dmpetrov
Copy link
Member Author

Based on my experience I'd assign the priorities like this:

  1. simple wildcard *
  2. globstar/ricursive **
  3. data path/file-%Y-%m-%d.txt
  4. iterator/count %C
  5. whole wildcard - ?, ., {}

But we need to agree on the common pattern format (how to reflect the pattern in dvc-files) before implementing even the first step.

@efiop efiop added the p3-nice-to-have It should be done this or next sprint label Nov 1, 2020
@efiop efiop changed the title Filename patters in import and pull add support for wildcard/patterns Nov 1, 2020
@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Nov 1, 2020
@efiop
Copy link
Contributor

efiop commented Nov 2, 2020

Regarding the first step

simple wildcard dvc pull cats-dogs/data/train/dogs/*.img

support for dir entries will simply require treating existing filter_info in

def collect_used_dir_cache(
appropriately. Right now we only check if filter equals or contains other files.

Regular glob patterns are clearer than the proposed date/counter selectors, those need some research on existing solutions. So this is a multilayer ticket that has a lot of special cases.

@karajan1001
Copy link
Contributor

Related #4419.

@ju0gri
Copy link
Contributor

ju0gri commented Nov 6, 2020

I will be taking a stab at implementing the first step for this issue.

ju0gri added a commit to ju0gri/dvc that referenced this issue Nov 11, 2020
Related to iterative#4816.

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>
ju0gri added a commit to ju0gri/dvc that referenced this issue Nov 11, 2020
Adds a new argument for the add command `glob` that is disabled by default and when enabled it passes
the input targets through glob filtering.

Related: iterative#4816

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>
efiop added a commit that referenced this issue Nov 11, 2020
* api: add support for simple wildcards

Related to #4816.

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>

* api: make wildcard interpretation optional

Adds a new argument for the add command `glob` that is disabled by default and when enabled it passes
the input targets through glob filtering.

Related: #4816

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>

* Update dvc/repo/add.py

* Update dvc/repo/add.py

* Update dvc/repo/add.py

* Update dvc/repo/add.py

* Update dvc/repo/add.py

Co-authored-by: Ruslan Kuprieiev <kupruser@gmail.com>
@jorgeorpinel

This comment has been minimized.

@efiop

This comment has been minimized.

@efiop
Copy link
Contributor

efiop commented Nov 19, 2020

Related #4912

@dmpetrov
Copy link
Member Author

Sound slike at least this check box could be marked, per #4864?

@jorgeorpinel #4864 is only about dvc add. pull/push/import are missing for checking the first checkbox.

@ju0gri
Copy link
Contributor

ju0gri commented Nov 24, 2020

I can continue adding this functionality for all commands, if that's alright.

@efiop
Copy link
Contributor

efiop commented Nov 24, 2020

@ju0gri Thanks for looking into it! 🙏

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 3, 2020

Question:

We've introduced the --glob option to a few commands to implement some of these patterns above (the ones covered by glob i.e. 1,2, and 5 from #4816 (comment))

Is the option temporary, expecting to make this default the behavior at some point? Otherwise I think we may need a better term as discussed in #4976 (comment), and even more now that I see patterns 3 (iterator) and 4 (date) which I think aren't covered by "glob".

Thanks

@ju0gri ju0gri mentioned this issue Dec 4, 2020
2 tasks
ju0gri added a commit to ju0gri/dvc that referenced this issue Dec 10, 2020
related: iterative#4816

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>
ju0gri added a commit to ju0gri/dvc that referenced this issue Dec 11, 2020
Related: iterative#4816

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>
ju0gri added a commit to ju0gri/dvc that referenced this issue Dec 17, 2020
Related: iterative#4816

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>
efiop added a commit that referenced this issue Dec 19, 2020
* api: add glob option for pull command

Related: #4816

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>

* api: add globbing utility function

Related: #4816

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>

* api: use utility function for pull command

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>

* Update dvc/utils/__init__.py

Co-authored-by: Ruslan Kuprieiev <kupruser@gmail.com>
efiop pushed a commit to ju0gri/dvc that referenced this issue Dec 25, 2020
Related: iterative#4816

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>
efiop pushed a commit that referenced this issue Dec 25, 2020
* api: add globbing option for pushing

Related: #4816

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>

* api: use utility function for push command

Signed-off-by: Ioana Grigoropol <ioana.grigoropol@gmail.com>
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 13, 2021

Hi, can we include the discussion about wildcards in stage output and dependency definitions (in dvc.yaml and maybe also run/stage add -od)? It's not listed in the check boxes of this issue's description, but it's mentioned in #1462 (comment) (2.A and B). Or I can make a separate issue for outs/deps.

A couple users have brought up the need for this in https://discuss.dvc.org/t/managing-pipelines-operating-per-dataset-element/613

@shcheklein shcheklein added A: pipelines Related to the pipelines feature and removed A: pipelines Related to the pipelines feature labels Apr 15, 2023
@tibor-mach
Copy link
Contributor

Seconding @jorgeorpinel on this, there is some new demand for wildcards on dvc stage outputs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature is a feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

7 participants