Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud execution #3

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions cloud-exec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Cloud execution

- Enhancement Proposal PR: (leave this empty)
- Contributors: dmpetrov

# Summary

Users need to run ML experiments in cloud instances or remote machines when
user does not have enough resources in a local machine or a bgger scale is
required for ML training (5 GPU instances running differnet experiments).

The cloud execution should be supported in DVC experiments level when all
the training results are presented in a single experiments table.

# Motivation

Today users need to spend time on instrumenting remote execution when there is
a lack of local resources for ML trainig (like GPU or RAM) or when it is a
company policy to train on dedicated hardware or on Kubernetes cluster.
The existing way of executing training through [CML](cml.dev) (CI/CD) works
great for handing off ML models between team members but is not the best
approach in an experimentation phase since CI/CD creates an overhead on the
execution time and it creates some footprint/records in CI/CD that is
a bit overwhelming when the number of experiments is huge (10s, 100s or
1000s).
This feature can save time for users and improve ML experimentation experience.

## Scenarios

1. **Run exp.** Allocate a cloud istance (with a specified configuration) and
execute an ML experiment `dvc exp run` (or `dvc repro`) on it.
2. **Run stage.** Allocate a cloud istance and execute a particular stage of
an experiment on it.
3. **Hot instance.** Execute on an existing running instance without instance
allocation.
4. **Remote machine.** Execute in a remote machine (cloud or on-premise one).
5. **Instance pool.** Execute on one of the existing running instances.
6. **Exp queue** Execute a set of experiments (from exp queue for example) in
an instance pool or a single instance.
Comment on lines +37 to +39
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what extent would DVC utilize pools? Are we thinking of parallel execution/ distributed computing? Or for now focusing on single instances (which may be obtained from a pool)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single instance or a pool of instances for parallel execution. But it is not about distributed learning. (if I understood the question correctly).

7. **Web run.** Execute an experiment from SaaS\Web using one of the methods
from above and DVC machinery.
Comment on lines +30 to +41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there different categories of scenarios? Some of these overlap. For example, a web run could apply to any of the other scenarios. To support all of these, it seems like these specs would be needed:

Compute resource spec

Provisioning: New vs. existing
Location: Cloud vs. on-premises
Count: Single instance vs pool

DVC execution spec

Stages: Single stage vs. full pipeline
Experiments: Single vs. multiple
Entrypoint: Local vs. SaaS/web

Does that seem accuratre? And is that a useful way to organize it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum I think the scenarios can def. be grouped. This is probably a non-exhaustive list of use cases.

I like the categories you summarized but wouldn't see them as specs (if you meant specs to implement the feature). The user will prob care about these distinctions, but they may not be relevant for the tool except perhaps whether running on a single or several instances (IF this includes parallel execution).

This is great for how to teach this, though!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum some regrouping might be needed. But in some cases, such grouping might be not relevant as @jorgeorpinel mentioned. An example - on-premise run is a prerequisite of cloud. I'd keep the whole list for now and try to cut the scope when we start implementing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, this just helps me organize it better in my head.


## Optimizations

Multiple cloud instance optimization technicques might be used to optimize
the usage of resources or execution time:
1. **Spot instances.**
2. **Transparent spot instance.** Recover the execution if a spot instance
was terminated. DVC checkpoints and pipeline stages should be used for
preserving the state.
3. **Volumes.** Volumes can be attached and reused in instances to minimize
data cache synchronization time.
4. **Shared volumes.** Separate cloud services (such as Multi-Attach EBS or
EFS) might be needed for sharing data cache between multiple instances.
Comment on lines +47 to +54
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we separate compute resource types (1 and 2) and storage resource types (3 and 4)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. But for now, the list of optimizations seems is very small to make separation or re-grouping. Also, some optimizations might be related - like re-attaching a volume when a spot instance fails.


## Monitoring

Additional features might be needed to improve user experienc:
1. **State monitoring.** A user should see if an instance was already
allocated, what stage is running.
2. **Metrics monitoring.** Latest metrics monitoring (with some reasonable
delay - 10-30-60 seconds)


## Resource orchestrators

1. AWS
2. Azure
3. GCP (optional)
4. Remote over SSH
5. Kubernetes (K8S)
6. HashiCorp Nomad (optional)
7. Container services (ECS and Azure/GCP analogs)
Comment on lines +67 to +73
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there are a few different categories here also:

  • Instance provisioning (1-3)
  • Managing existing resources (4)
  • Cluster provisioning (5-7)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I list them in priority order from the user's point of view.

Do you think the categories will help? There might be many ways of categorizing. From an implementation point of view:

  • (4) is a prerequisite for all others
  • among the clouds (1-3), GCP (3) is "special"
  • (5)/K8S should be similar to clouds (1-3) with some "specialty"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These categories are all helpful for me, especially since these lists have items of different types, which confuses me if I don't break them down. Again, not important as long as all the scenarios are clear to everyone.



# Detailed design

## Iterative Terraform provider

The cloud management part seems like a heavy part of the project (from
the implementation point of view). However, the problem of executing and
managing cloud instances was solved in the CML project.

CML separates out the user API part (commands) and the cloud instace
management code part.
The cloud management is extracted to a separate **Iterative Terraform (TF)
provider**:
https://github.com/iterative/terraform-provider-iterative

The Iterative provider is specific to data scientists, not DevOps (TF's
target audience). It has a bit simplified API and some ML specific
functionality (some of these are coming) like: recover spot-instance if
it was deleted or attach a reusable drive.

The Iteratve TF provider can be used from DVC for cloud instace management.
Some Python libraries might be required for access to GoLang specific
terraform-provider.

Today the provider supports:
- AWS, Azure, K8S orchestrators
- Spot instances and Volumes optimizations.
A few of the initial scenarios can be already implemented with these
functionalities. More functionality is comming.

## Instances configuration

Instance configuration is a definition of an instance and its behavior:
- orchestrator: AWS, GCP, Azure, remote, K8S, ...
- instance type
- volume
- many other options

Instance configuration is well developed in CML project and can be reused.

## Executors and naming

Ideally, a user should be able to define or overwrite executors in config
(`dvc.yml` or `.dvc/config`). `dvc exp run` should execute,
provision/terminate if needed.

The executor definition (`p3.large` AWS instance in `us-west` with `abcdef`
volume attached) should be decoupled from pipeline definition the same way
as remotes are: stage `train` runs on `my-gpu-tesla`, not executor
definition with `p3.large`.
Comment on lines +121 to +124
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define in dvc.yaml, it makes it hard to run elsewhere (like locally), and it couples experiment history and infrastructure (similar to problems with changing remotes today). Maybe having a local config option or having an exp run --executor flag is sufficient flexibility here? What would we do differently with remotes if we were starting from scratch?


Many instance types and even clouds could be used in a single DVC project.
We need a way of configuring **executors** and reusing them the same way
as user can configure a data remote.
It can look like this:
```bash
$ dvc exp run --executor my-gpu-tesla
...
Comment on lines +129 to +132
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would executors get setup like remotes?

$ dvc executor add my-gpu-tesla ssh://myuser@example.com
$ dvc executor modify my-gpu-tesla pword 'myp4$$'
$ dvc exp run --on my-gpu-tesla

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are resulting outputs synchronized back into the workspace? Or does the executor keep it's own repo clone with separate state?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are resulting outputs synchronized back into the workspace? Or does the executor keep it's own repo clone with separate state?

My assumption here would be that the results of an experiment are synced back into the workspace ("results" meaning git-tracked repo state). Current --temp exp runs are already (local) executors, and experiment execution is already structured so that it would work with any machine we can talk to via git+ssh. So the git-tracked results of an executor run will be retrieved as experiment refs (like with existing local --temp runs).

For DVC-tracked (cached) outputs, they would be fetched into the local cache. This could be done by either fetching the data directly from the executor machine, or by using an intermediate DVC remote (so after a run, the executor does dvc push to a DVC remote, and then the user's local machine does dvc pull). The final dvc pull could actually be optional here, since the user may not need the cache data on their local machine at all.

Using the intermediate remote seems like it would fit better to me, but I think that needs some more discussion.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm... exp run --temp seems to behave more like a remote executor to me. Except that "remote" in this case means "external" (again, like "local remotes" which has never been a great name BTW). But what's the point here? That we can reuse that implementation? If so, sounds good!

Also agree about the assumption and proposed mechanics, except that what if the user hasn't setup any remote storage to use as the intermediate step? Will there be some default ones maintained by us (similar to the default Google Cloud Project)? In any case, a direct transfer may have better performance?

$ dvc exp run --executor my-gpu-tesla8
...
$ dvc exp run --executor tesla8-ram
```

## Queue of executors

Queue of executors (like 4 `p3.large` "hot" instances with shared volume)
seems like a very common case and should be natively supported. It should
look like a singele unit (`my-gpu-tesla-hotpool`) from pipeline point of
view.

## Queue of experiments

Executing experiments is a slow and expensive operation and might need more
granular management on the experiment level: create, remove, list, etc.
See here: https://github.com/iterative/dvc/issues/5615

# How We Teach This

We need to introduce a concept **Executor** to DVC.
This convept should not break the compatibility and should not require
significant changes in the docs (except command options).v
However, it will require a new section in the docs to explain the cloud
execution for the users who'd like to use this feature.
Comment on lines +153 to +157
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to abstract the execution environment, maybe we could start education from "local" executor - the machine user is using at the moment. I feel like explaining remotes starting from local ones made it easier for me.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start education from "local" executor

Good idea unless that implies significant changes to docs (e.g. having to use the local executor concept everywhere repro/run are mentioned). BTW I don't think we should avoid such big changes at all costs, just not sure it's necessary here. It may make things confusing for people looking for a basic local workflow.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "local" executor already exists (exp run --temp), but we have already run into potential problems with making this a "default" behavior. Namely that debugging your pipeline becomes very difficult when the changes are not happening in your normal workspace.

So I think we would need to clarify what we want to do regarding educating the user about non-workspace execution environments.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "local" executor already exists (exp run --temp)

Good point. Local (to the env) but external (to the project)? Like "local remotes".

UPDATE: Actually I'm not sure. Maybe the local machine itself IS the executor, whether on a regular repro or an exp run --temp. Otherwise what exactly is an executor? 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared could you please clarify? Do you mean abstracting environments like pip/conda packages or just directory (with --temp as @pmrowla mentioned)?

Maybe the local machine itself IS the executor

@jorgeorpinel yes, it is the default executor. @pmrowla point is that we can abstract it out to a separate directory. But this is what exp run --queue does.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmpetrov Not quite, I was rather thinking that we are already executing locally. In a way, our local environment is an executor, so what I though would be good is to teach this concept by saying "you already are using (local) executor - when using --temp". So as analogy - when you will be using remote executors, you will do similar thing as with --temp and local one - with all its pros and ceveats.


# Drawbacks

First, this feature opens new user scenarios which increase the complexity
of DVC, it will require new sections in the docs, etc. However, it seems
like the only way of implementing remote execution if the DVC experements
table is used as a common leddger of ML runs.

It might be beneficial to extract as much of this functionality as possible
to external products and tools (terraform providers for example) to
simplify DVC.

Second, more people might confuse DVC with data engineerings pipeline
execution tools such as AirFlow, Dagster and others.
We should provide a clear guideline when DVC should be used for workflow
execution and when it should not.

# Alternatives

The cloud management logic can be implemented in DVC without using Iterative
FT providers.

# Unresolved questions

TBD