-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud execution #3
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
# Cloud execution | ||
|
||
- Enhancement Proposal PR: (leave this empty) | ||
- Contributors: dmpetrov | ||
|
||
# Summary | ||
|
||
Users need to run ML experiments in cloud instances or remote machines when | ||
user does not have enough resources in a local machine or a bgger scale is | ||
required for ML training (5 GPU instances running differnet experiments). | ||
|
||
The cloud execution should be supported in DVC experiments level when all | ||
the training results are presented in a single experiments table. | ||
|
||
# Motivation | ||
|
||
Today users need to spend time on instrumenting remote execution when there is | ||
a lack of local resources for ML trainig (like GPU or RAM) or when it is a | ||
company policy to train on dedicated hardware or on Kubernetes cluster. | ||
The existing way of executing training through [CML](cml.dev) (CI/CD) works | ||
great for handing off ML models between team members but is not the best | ||
approach in an experimentation phase since CI/CD creates an overhead on the | ||
execution time and it creates some footprint/records in CI/CD that is | ||
a bit overwhelming when the number of experiments is huge (10s, 100s or | ||
1000s). | ||
This feature can save time for users and improve ML experimentation experience. | ||
|
||
## Scenarios | ||
|
||
1. **Run exp.** Allocate a cloud istance (with a specified configuration) and | ||
execute an ML experiment `dvc exp run` (or `dvc repro`) on it. | ||
2. **Run stage.** Allocate a cloud istance and execute a particular stage of | ||
an experiment on it. | ||
3. **Hot instance.** Execute on an existing running instance without instance | ||
allocation. | ||
4. **Remote machine.** Execute in a remote machine (cloud or on-premise one). | ||
5. **Instance pool.** Execute on one of the existing running instances. | ||
6. **Exp queue** Execute a set of experiments (from exp queue for example) in | ||
an instance pool or a single instance. | ||
7. **Web run.** Execute an experiment from SaaS\Web using one of the methods | ||
from above and DVC machinery. | ||
Comment on lines
+30
to
+41
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there different categories of scenarios? Some of these overlap. For example, a web run could apply to any of the other scenarios. To support all of these, it seems like these specs would be needed: Compute resource specProvisioning: New vs. existing DVC execution specStages: Single stage vs. full pipeline Does that seem accuratre? And is that a useful way to organize it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dberenbaum I think the scenarios can def. be grouped. This is probably a non-exhaustive list of use cases. I like the categories you summarized but wouldn't see them as specs (if you meant specs to implement the feature). The user will prob care about these distinctions, but they may not be relevant for the tool except perhaps whether running on a single or several instances (IF this includes parallel execution). This is great for how to teach this, though! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dberenbaum some regrouping might be needed. But in some cases, such grouping might be not relevant as @jorgeorpinel mentioned. An example - on-premise run is a prerequisite of cloud. I'd keep the whole list for now and try to cut the scope when we start implementing this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No problem, this just helps me organize it better in my head. |
||
|
||
## Optimizations | ||
|
||
Multiple cloud instance optimization technicques might be used to optimize | ||
the usage of resources or execution time: | ||
1. **Spot instances.** | ||
2. **Transparent spot instance.** Recover the execution if a spot instance | ||
was terminated. DVC checkpoints and pipeline stages should be used for | ||
preserving the state. | ||
3. **Volumes.** Volumes can be attached and reused in instances to minimize | ||
data cache synchronization time. | ||
4. **Shared volumes.** Separate cloud services (such as Multi-Attach EBS or | ||
EFS) might be needed for sharing data cache between multiple instances. | ||
Comment on lines
+47
to
+54
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we separate compute resource types (1 and 2) and storage resource types (3 and 4)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. But for now, the list of optimizations seems is very small to make separation or re-grouping. Also, some optimizations might be related - like re-attaching a volume when a spot instance fails. |
||
|
||
## Monitoring | ||
|
||
Additional features might be needed to improve user experienc: | ||
1. **State monitoring.** A user should see if an instance was already | ||
allocated, what stage is running. | ||
2. **Metrics monitoring.** Latest metrics monitoring (with some reasonable | ||
delay - 10-30-60 seconds) | ||
|
||
|
||
## Resource orchestrators | ||
|
||
1. AWS | ||
2. Azure | ||
3. GCP (optional) | ||
4. Remote over SSH | ||
5. Kubernetes (K8S) | ||
6. HashiCorp Nomad (optional) | ||
7. Container services (ECS and Azure/GCP analogs) | ||
Comment on lines
+67
to
+73
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like there are a few different categories here also:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I list them in priority order from the user's point of view. Do you think the categories will help? There might be many ways of categorizing. From an implementation point of view:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These categories are all helpful for me, especially since these lists have items of different types, which confuses me if I don't break them down. Again, not important as long as all the scenarios are clear to everyone. |
||
|
||
|
||
# Detailed design | ||
|
||
## Iterative Terraform provider | ||
|
||
The cloud management part seems like a heavy part of the project (from | ||
the implementation point of view). However, the problem of executing and | ||
managing cloud instances was solved in the CML project. | ||
|
||
CML separates out the user API part (commands) and the cloud instace | ||
management code part. | ||
The cloud management is extracted to a separate **Iterative Terraform (TF) | ||
provider**: | ||
https://github.com/iterative/terraform-provider-iterative | ||
|
||
The Iterative provider is specific to data scientists, not DevOps (TF's | ||
target audience). It has a bit simplified API and some ML specific | ||
functionality (some of these are coming) like: recover spot-instance if | ||
it was deleted or attach a reusable drive. | ||
|
||
The Iteratve TF provider can be used from DVC for cloud instace management. | ||
Some Python libraries might be required for access to GoLang specific | ||
terraform-provider. | ||
|
||
Today the provider supports: | ||
- AWS, Azure, K8S orchestrators | ||
- Spot instances and Volumes optimizations. | ||
A few of the initial scenarios can be already implemented with these | ||
functionalities. More functionality is comming. | ||
|
||
## Instances configuration | ||
|
||
Instance configuration is a definition of an instance and its behavior: | ||
- orchestrator: AWS, GCP, Azure, remote, K8S, ... | ||
- instance type | ||
- volume | ||
- many other options | ||
|
||
Instance configuration is well developed in CML project and can be reused. | ||
|
||
## Executors and naming | ||
|
||
Ideally, a user should be able to define or overwrite executors in config | ||
(`dvc.yml` or `.dvc/config`). `dvc exp run` should execute, | ||
provision/terminate if needed. | ||
|
||
The executor definition (`p3.large` AWS instance in `us-west` with `abcdef` | ||
volume attached) should be decoupled from pipeline definition the same way | ||
as remotes are: stage `train` runs on `my-gpu-tesla`, not executor | ||
definition with `p3.large`. | ||
Comment on lines
+121
to
+124
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we define in |
||
|
||
Many instance types and even clouds could be used in a single DVC project. | ||
We need a way of configuring **executors** and reusing them the same way | ||
as user can configure a data remote. | ||
It can look like this: | ||
```bash | ||
$ dvc exp run --executor my-gpu-tesla | ||
... | ||
Comment on lines
+129
to
+132
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would executors get setup like remotes? $ dvc executor add my-gpu-tesla ssh://myuser@example.com
$ dvc executor modify my-gpu-tesla pword 'myp4$$'
$ dvc exp run --on my-gpu-tesla There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are resulting outputs synchronized back into the workspace? Or does the executor keep it's own repo clone with separate state? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
My assumption here would be that the results of an experiment are synced back into the workspace ("results" meaning git-tracked repo state). Current For DVC-tracked (cached) outputs, they would be fetched into the local cache. This could be done by either fetching the data directly from the executor machine, or by using an intermediate DVC remote (so after a run, the executor does Using the intermediate remote seems like it would fit better to me, but I think that needs some more discussion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmmm... Also agree about the assumption and proposed mechanics, except that what if the user hasn't setup any remote storage to use as the intermediate step? Will there be some default ones maintained by us (similar to the default Google Cloud Project)? In any case, a direct transfer may have better performance? |
||
$ dvc exp run --executor my-gpu-tesla8 | ||
... | ||
$ dvc exp run --executor tesla8-ram | ||
``` | ||
|
||
## Queue of executors | ||
|
||
Queue of executors (like 4 `p3.large` "hot" instances with shared volume) | ||
seems like a very common case and should be natively supported. It should | ||
look like a singele unit (`my-gpu-tesla-hotpool`) from pipeline point of | ||
view. | ||
|
||
## Queue of experiments | ||
|
||
Executing experiments is a slow and expensive operation and might need more | ||
granular management on the experiment level: create, remove, list, etc. | ||
See here: https://github.com/iterative/dvc/issues/5615 | ||
|
||
# How We Teach This | ||
|
||
We need to introduce a concept **Executor** to DVC. | ||
This convept should not break the compatibility and should not require | ||
significant changes in the docs (except command options).v | ||
However, it will require a new section in the docs to explain the cloud | ||
execution for the users who'd like to use this feature. | ||
Comment on lines
+153
to
+157
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we are to abstract the execution environment, maybe we could start education from "local" executor - the machine user is using at the moment. I feel like explaining remotes starting from local ones made it easier for me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Good idea unless that implies significant changes to docs (e.g. having to use the local executor concept everywhere There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "local" executor already exists ( So I think we would need to clarify what we want to do regarding educating the user about non-workspace execution environments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Good point. Local (to the env) but external (to the project)? Like "local remotes". UPDATE: Actually I'm not sure. Maybe the local machine itself IS the executor, whether on a regular There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pared could you please clarify? Do you mean abstracting environments like pip/conda packages or just directory (with
@jorgeorpinel yes, it is the default executor. @pmrowla point is that we can abstract it out to a separate directory. But this is what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dmpetrov Not quite, I was rather thinking that we are already executing locally. In a way, our local environment is an executor, so what I though would be good is to teach this concept by saying "you already are using (local) executor - when using |
||
|
||
# Drawbacks | ||
|
||
First, this feature opens new user scenarios which increase the complexity | ||
of DVC, it will require new sections in the docs, etc. However, it seems | ||
like the only way of implementing remote execution if the DVC experements | ||
table is used as a common leddger of ML runs. | ||
|
||
It might be beneficial to extract as much of this functionality as possible | ||
to external products and tools (terraform providers for example) to | ||
simplify DVC. | ||
|
||
Second, more people might confuse DVC with data engineerings pipeline | ||
execution tools such as AirFlow, Dagster and others. | ||
We should provide a clear guideline when DVC should be used for workflow | ||
execution and when it should not. | ||
|
||
# Alternatives | ||
|
||
The cloud management logic can be implemented in DVC without using Iterative | ||
FT providers. | ||
|
||
# Unresolved questions | ||
|
||
TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To what extent would DVC utilize pools? Are we thinking of parallel execution/ distributed computing? Or for now focusing on single instances (which may be obtained from a pool)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single instance or a pool of instances for parallel execution. But it is not about distributed learning. (if I understood the question correctly).