Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiline command? #4373

Closed
skshetry opened this issue Aug 11, 2020 · 19 comments · Fixed by #4934 or #5194
Closed

Support multiline command? #4373

skshetry opened this issue Aug 11, 2020 · 19 comments · Fixed by #4934 or #5194
Assignees
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint

Comments

@skshetry
Copy link
Member

Something like done in Github actions with run and script?

stages:
  stage1:
    cmd: |
      cmd1
      cmd2

Or, perhaps:

stages:
  stage1:
    cmd:
      - cmd1
      - cmd2

Or, maybe both?

Need to double check if we will be able to keep the dump in this shape or form.

Why?

Better expression

Current Workaround:

stages:
  stage1:
    cmd: >-
      cmd1
      && cmd2
      && cmd3
@efiop efiop added feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint labels Aug 11, 2020
@efiop
Copy link
Contributor

efiop commented Nov 16, 2020

For the record:

stages:
  stage1:
    cmd: |
      cmd1
      cmd2

this is already supported and will be passed as is to the shell (we use shell=True in Popen). I would keep it as is.

stages:
  stage1:
    cmd:
      - cmd1
      - cmd2

This looks great and I think it should behave like cmd1 && cmd2 && cmd3, meaning that if one command fails then the rest is not executed and results in a stage failure.

@ClementWalter
Copy link
Contributor

ClementWalter commented Nov 21, 2020

I have started to look at the code. @efiop is it ok to add a self.cmd = " && ".join(cmd) if isintance(cmd, list) else cmd in Stage.__init__ or should the Stage be able to truly handle a list of commands?

@efiop
Copy link
Contributor

efiop commented Nov 21, 2020

Hi @ClementWalter !

Thanks for looking into it! What advantages/disadvantages do you see for doing a join like that?

@efiop
Copy link
Contributor

efiop commented Dec 3, 2020

Keeping open for the question about multiline cmd handling raised by @skshetry

@efiop efiop reopened this Dec 3, 2020
@jorgeorpinel
Copy link
Contributor

Hi. I'm curious what's the use case for this. Is there a good example of multiple commands that share deps and outs? (to improve the docs)

@skshetry
Copy link
Member Author

skshetry commented Dec 29, 2020

Hi. I'm curious what's the use case for this.

This is more of a convenience/ergonomical feature. So, anything you can write in a single line can be written in multiple lines or vice versa but might be easier than just cramming it in one with &&.

Is there a good example of multiple commands that share deps and outs? (to improve the docs)

You can use Github search for this (about usage of &&). But, why does it have to share deps/outs at all?

Here's a list of what I could find:

  • building a model and testing it with a different script
  • building each model in a separate virtual environment and evaluating them
  • building each model in a separate directory
  • cd ../directory && python script.py - looks like they are keeping dvc repo and code separately.
  • fetching a pre-trained model and training them on local datasets (wget, etc).
  • cleaning up randomly generated files after building/training/evaluating.
  • setting some env vars, PythonPath, etc.
  • cd directory && python test.py - maybe cd is more convenient than wdir as cd only affects during runtime.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 29, 2020

Thanks! I'm aware of POSIX &&. My question is about a real user need that this issue initially tried to cover. I'm getting the idea that this is basically for nice formatting (alt. to && basically).

Also thanks for the specific example ideas! I think that most of them may be better served by other forms thouogh:

  • building a model and testing it with a different script
  • fetching a pre-trained model and training them on local datasets (wget, etc).
  • cleaning up randomly generated files after building/training/evaluating.

Multiple stages

  • building each model in a separate virtual environment and evaluating them
  • building each model in a separate directory

Stage groups (foreach syntax in DVC 2.0)

  • cd ../directory && python script.py
  • cd directory && python test.py

wdir (or just keeping &&)

setting several env vars seems like the only instance where this is really convenient, to avoid an ugly string full of &&.

why does it have to share deps/outs at all?

Initially I was assuming that would be the point of having several commands in a stage, since having each command have separate deps/outs and listing those in a single stage doesn't sound very clean (again, better to use 2 stages instead?) Although I suppose if the stages are parallel, it could be a nice use of this feature, along with a desc or # comment about that perhaps.

Anyway, my Q is answered. Will check with Ivan whether it's worth improving the docs for this. Thanks again!

@skshetry
Copy link
Member Author

Multiple stages

You are thinking of an ideal situation. I won't be creating a separate stage just to clean the logfile the script creates. It's a convenience feature.

  • building each model in a separate virtual environment and evaluating them
  • building each model in a separate directory

Stage groups (foreach syntax in DVC 2.0)

How do you create a virtual environment and run the script with foreach? It's a different thing.

  • cd ../directory && python script.py
  • cd directory && python test.py

wdir (or just keeping &&)

Like I said, cd definitely has its place. wdir applies to all outs/deps whereas cd is a runtime component.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 29, 2020

It all seems very hypothetical to me but yes, you may be right. It's nice to have this option anyway.

@skshetry
Copy link
Member Author

It all seems very hypothetical to me but yes, you may be right.

All of the examples are from Github's search. How could that be hypothetical?

@jorgeorpinel
Copy link
Contributor

Oh I meant not from DVC users.

@shcheklein
Copy link
Member

@jorgeorpinel

Oh I meant not from DVC users.

I guess, they are from DVC users since they come from Github repos that use DVC.

@jorgeorpinel
Copy link
Contributor

Ah, I didn't realize those GitHub repos use DVC! I thought it was from generic repos. Will try to find some some of them...

@jorgeorpinel
Copy link
Contributor

Too bad that

You can't use the following wildcard characters as part of your search query: . , : ; / \ ` ' " = * ! ? # $ & + ^ | ~ < > ( ) { } [ ]. The search will simply ignore these symbols.

i.e. I can't actually search for && - according to GH search docs.

So far I've been able to find one interesting example (python -m ... && fdupes ...) in my search but @skshetry if you have a better search can you share it please? I'll keep looking for now.

@johnnychen94
Copy link
Contributor

johnnychen94 commented Dec 31, 2020

building each model in a separate virtual environment and evaluating them

Here's the Julia example:

  • each stage has a separate virtual environment: --project=prepare. Package dependencies are provided by prepare/Project.toml and tracked by git.
  • Pkg.instantiate() sets up the project environment (like pip install -r requirements.txt)
prepare:
  foreach: ${models}
  do:
    cmd: >-
      julia --project=prepare -e "using Pkg; Pkg.instantiate()" &&
      julia --project=prepare \
        prepare/main.jl data/train/rawdata/ data/train/prepared/${item.name}.h5 \
        --patch-stride ${item.data.patch_stride} \
        --patch-size ${item.data.patch_size} \
        --augment-times ${item.data.augment_times}
    deps:
    - data/train/rawdata
    - prepare
    outs:
    - data/train/prepared/${item.name}.h5

I'm thinking of dvc as a build system like make, so setting up a virtual environment has real merits for reproducibility. Version compatibility is the root of many hidden problems.

@skshetry
Copy link
Member Author

@johnnychen94, we do support list now if you are using from the master.

So, you could write that as:

cmd:
  - julia --project=prepare -e "using Pkg; Pkg.instantiate()"
  - julia --project=prepare \
        prepare/main.jl data/train/rawdata/ data/train/prepared/${item.name}.h5 \
        --patch-stride ${item.data.patch_stride} \
        --patch-size ${item.data.patch_size} \
        --augment-times ${item.data.augment_times}

I see that you use \ at the end which should not be required as you are using folded style string (with >-), which will make it even cleaner.

cmd:
  - julia --project=prepare -e "using Pkg; Pkg.instantiate()"
  - >-
    julia --project=prepare
        prepare/main.jl data/train/rawdata/ data/train/prepared/${item.name}.h5
        --patch-stride ${item.data.patch_stride}
        --patch-size ${item.data.patch_size}
        --augment-times ${item.data.augment_times}

@skshetry
Copy link
Member Author

skshetry commented Dec 31, 2020

i.e. I can't actually search for && - according to GH search docs.

@jorgeorpinel, I use a simple script to fetch all the files matching a query using Github's token and search locally. There aren't that many, so should be quick. Try the script in wsl.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 3, 2021

Thanks for sharing your real like example @johnnychen94 ! I'm not familiar with Julia so forgive me if this is obvious but why do you run using Pkg; Pkg.instantiate() separately instead of having it in prepare/main.jl?

I'll try your script and lyk @skshetry! Thanks

@johnnychen94
Copy link
Contributor

I'm not familiar with Julia so forgive me if this is obvious but why do you run using Pkg; Pkg.instantiate() separately instead of having it in prepare/main.jl?

There are no standard rules about it. It's doable but I deliberately kept them apart because I think the environment initialization part has nothing to do with the real code. We normally run the real code multiple times (for developing or debugging purposes) while only initialize the environment once.

This is also something that the whole Julia community tries to commit when dealing with CI. For example: https://github.com/JuliaImages/juliaimages.github.io/blob/cbd83915f5dfb0259fedaa3d1d14c67f34d983eb/.travis.yml#L14-L16

@efiop efiop self-assigned this Jan 4, 2021
efiop added a commit to efiop/dvc that referenced this issue Jan 4, 2021
Implements an approach proposed by @skshetry.

Fixes iterative#4373
efiop added a commit that referenced this issue Jan 4, 2021
Implements an approach proposed by @skshetry.

Fixes #4373
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
None yet
6 participants