Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs & readme: what/why/how #433

Merged
merged 33 commits into from
Mar 15, 2022
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
1bcd9fc
first pass README
casperdcl Mar 8, 2022
5368f7e
Restyled by prettier-markdown
restyled-commits Mar 8, 2022
7cfe25b
readme: more restructuring
casperdcl Mar 8, 2022
6992d0e
readme: have a basic script
casperdcl Mar 8, 2022
76e11f6
docs: unify examples
casperdcl Mar 8, 2022
f0b6814
intuitive output
casperdcl Mar 8, 2022
e932990
docs/task: input realtive to script
casperdcl Mar 8, 2022
a4e14d7
minor tweak
casperdcl Mar 8, 2022
b1fa61b
some rewording
casperdcl Mar 8, 2022
2a11af5
docs: rename `workdir.input` => `storage.workdir`
casperdcl Mar 10, 2022
66a5a42
describe machine type better
casperdcl Mar 10, 2022
794304c
docs: update site landing
casperdcl Mar 10, 2022
c17f707
update badges
casperdcl Mar 10, 2022
ddf4910
note on workdir/output relation
casperdcl Mar 10, 2022
fb2ac4f
Restyled by prettier-markdown
restyled-commits Mar 10, 2022
5b18120
misc review updates
casperdcl Mar 10, 2022
40aabcc
readme: shorten contributing
casperdcl Mar 10, 2022
87a76ec
docs: licence badge, tidy installation instructions
casperdcl Mar 10, 2022
97583de
Restyled by prettier-markdown
restyled-commits Mar 10, 2022
6a0ffa2
re-add contrib build details
casperdcl Mar 11, 2022
68fe91d
review suggestions
casperdcl Mar 11, 2022
ac1dfc4
more sync & minification
casperdcl Mar 11, 2022
3f81ab0
docs: note/warning consistency
casperdcl Mar 11, 2022
18f7130
more feedback
casperdcl Mar 11, 2022
c5c1f1b
readme: re-added copyright
casperdcl Mar 11, 2022
d4b3736
remove licence year
casperdcl Mar 11, 2022
3b05aa3
docs: separate authentication guide
casperdcl Mar 11, 2022
4381e65
docs: minify landing further
casperdcl Mar 11, 2022
474ace9
explicit CPU/GPU/RAM
casperdcl Mar 11, 2022
6b91aa3
spotify
casperdcl Mar 14, 2022
9f2a105
re-separate commands, add `disk_size`
casperdcl Mar 14, 2022
aac92bc
update banner
casperdcl Mar 14, 2022
fd848e2
Restyled by prettier-markdown
restyled-commits Mar 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 60 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,28 @@
![Terraform Provider Iterative](https://static.iterative.ai/img/cml/banner-terraform.png)
![TPI](https://static.iterative.ai/img/cml/banner-terraform.png)

# Iterative Provider [![](https://img.shields.io/badge/-documentation-5c4ee5?logo=terraform)](https://registry.terraform.io/providers/iterative/iterative/latest/docs)
casperdcl marked this conversation as resolved.
Show resolved Hide resolved
# Terraform Provider Iterative
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

The Iterative Provider is a Terraform plugin that enables full lifecycle
management of computing resources for machine learning pipelines, including GPUs, from your favorite cloud vendors.
[![docs](https://img.shields.io/badge/-docs-5c4ee5?logo=terraform)](https://registry.terraform.io/providers/iterative/iterative/latest/docs)
[![tests](https://img.shields.io/github/workflow/status/iterative/terraform-provider-iterative/Test?label=tests&logo=GitHub)](https://github.com/iterative/terraform-provider-iterative/actions/workflows/test.yml)
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

The Iterative Provider makes it easy to:
casperdcl marked this conversation as resolved.
Show resolved Hide resolved
- **Orchestrate Resources**: create cloud compute & storage resources without reading pages of documentation
0x2b3bfa0 marked this conversation as resolved.
Show resolved Hide resolved
casperdcl marked this conversation as resolved.
Show resolved Hide resolved
- **Sync & Execute**: move data & run code in the cloud with minimal configuration
casperdcl marked this conversation as resolved.
Show resolved Hide resolved
- **Low cost**: auto-recovery from spot/preemptible instances to vastly reduce cost
casperdcl marked this conversation as resolved.
Show resolved Hide resolved
- **No waste**: auto-cleanup unused resources
casperdcl marked this conversation as resolved.
Show resolved Hide resolved
- **No lock-in**: switch between cloud vendors with ease
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

- Rapidly move local machine learning experiments to a cloud infrastructure
- Take advantage of training models on spot instances without losing any progress
- Unify configuration of various cloud compute providers
- Automatically destroy unused cloud resources (compute instances are terminated on job completion/failure, and storage is removed when results are downloaded)
Iterative's Provider is a [Terraform](https://terraform.io) plugin built with machine learning pipelines in mind. It enables full lifecycle management of computing resources (including GPUs) from several cloud vendors:
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

The Iterative Provider can provision resources with the following cloud providers and orchestrators:

- Amazon Web Services
- Amazon Web Services (AWS)
- Microsoft Azure
- Google Cloud Platform
- Kubernetes

## Documentation

See the [Getting Started](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) guide to learn how to use the Iterative Provider. More details on configuring and using the Iterative Provider are in the [documentation](https://registry.terraform.io/providers/iterative/iterative/latest/docs).
- Google Cloud Platform (GCP)
- Kubernetes (K8s)

## Support
With a minimal configuration unified across cloud vendors, the aim is to easily move local experiments to the cloud, transparently resume from interrupted low-cost spot instances, and avoid being charged for unused cloud resources (terminate compute instances upon job completion/failure, and remove storage upon download of results).
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

Have a feature request or found a bug? Let us know via [GitHub issues](https://github.com/iterative/terraform-provider-iterative/issues). Have questions? Join our [community on Discord](https://discord.gg/bzA6uY7); we'll be happy to help you get started!

## License
## Usage

Iterative Provider is released under the [Apache 2.0 License](https://github.com/iterative/terraform-provider-iterative/blob/master/LICENSE).

## Development

### Install Go 1.17+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking... what if a user does not have any ml training script on heirs hands? Or a user has a script but the environment settings will prevent user from executing it? Does it make sense to provide a script in additional to the TF file?

It should reduce the entrance bar quite significantly.

An ideal tutorial should have all the commands, scripts and data I need to run to get a result. Example - dvc tutorial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user does not have any ml training script

I don't follow, do you mean #305?

user has a script but the environment settings will prevent executing

Do you mean the script won't execute locally or won't execute on the cloud? And why won't it execute? Because of missing env vars such as NUM_EPOCHS or something? Or missing dependencies like numpy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, user might not have any ml training script at all (or it won't work).
In DVC tutorial, user downloads code as the first step. Do we need something similar here?

Copy link
Contributor Author

@casperdcl casperdcl Mar 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so an example repo? No we don't have one (yet). Technically we'd probably need 3 - one each for AWS/Azure/GCP.

The current example simply uses AWS and an echo Hello World script, so the user just copy-pastes the main.tf and has no other dependencies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd probably need 3 - one each for AWS/Azure/GCP

😬 Unless we use iterative/example-repos-dev or similar, it doesn't look like the best idea from a maintainability standpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd probably need 3 - one each for AWS/Azure/GCP.

I was thinking that TPI is suppose abstract it out 😄 One code file with one "small" data file that user can get with wget should be enough.

It would be great to have some "realistic" repo with checkpoint etc... like minst but slower :) Fashion-mnist?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only required change is cloud = "whatever" 😅

Copy link
Contributor Author

@casperdcl casperdcl Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think at a minimum in example repos:

  • main.tf: cloud = "X"
  • README.md: "How to authenticate/export X_CREDENTIALS, or use [cloud Y](link to example repo for Y) or [cloud Z](link to example repo for Z)"

Plus probably:

  • requirements.txt
  • run.py
  • .github/workflows/cml.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think this complexity is needed in the readme here. Right now it's a self-contained1 example that supports users both with and without a script file.

Footnotes

  1. apart from the how-to-setup-credentials external link


Refer to the [official documentation](https://golang.org/doc/install) for specific instructions.

### Clone the repository

```console
git clone https://github.com/iterative/terraform-provider-iterative
cd terraform-provider-iterative
```
See the [Getting Started](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) guide for a more detailed guide.
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

### Install the provider
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

Expand All @@ -61,10 +41,23 @@ terraform {
required_providers { iterative = { source = "iterative/iterative" } }
}
provider "iterative" {}
# ... other resource blocks ...
resource "iterative_task" "example" {
cloud = "aws" # or any of: gcp, az, k8s
machine = "m" # medium, or any of: l, xl, m+k80, xl+v100, ...

storage {
workdir = "."
output = "results"
}
script = <<-END
#!/bin/bash
mkdir results
echo "Hello World!" > results/greeting.txt
END
}
```

**Note:** to use your local build, specify `source = "github.com/iterative/iterative"` (`source = "iterative/iterative"` will download the latest stable release instead).
See the [Documentation](https://registry.terraform.io/providers/iterative/iterative/latest/docs) for obtaining credentials for the chosen `cloud`, and the [Reference](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task) for the full list of options for `main.tf`.

### Initialize the provider

Expand All @@ -79,3 +72,32 @@ terraform init --upgrade
```console
terraform apply
```

casperdcl marked this conversation as resolved.
Show resolved Hide resolved
## Help

Have a feature request or found a bug? Let us know via [GitHub issues](https://github.com/iterative/terraform-provider-iterative/issues). Have questions? Join our [community on Discord](https://discord.gg/bzA6uY7); we'll be happy to help you get started!

## Contributing
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

Instead of using the latest stable release, a local copy of the repository must be used.

### Install Go 1.17+

Refer to the [official documentation](https://golang.org/doc/install) for specific instructions.

### Clone the repository

```console
git clone https://github.com/iterative/terraform-provider-iterative
cd terraform-provider-iterative
```

### Modify test file

Specify `source = "github.com/iterative/iterative"` to use the local repository.

**Note:** `source = "iterative/iterative"` will download the latest release instead.

## License

[Apache 2.0](https://github.com/iterative/terraform-provider-iterative/blob/master/LICENSE).
26 changes: 13 additions & 13 deletions docs/guides/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,25 +13,25 @@ To use the Iterative Provider you will need to:

In the project root directory:

1. Create a directory named `shared` to store input data and output artefacts.
2. Create a file named `main.tf` with the following contents:
Create a file named `main.tf` with the following contents:

```hcl
terraform {
required_providers { iterative = { source = "iterative/iterative" } }
}
provider "iterative" {}
resource "iterative_task" "task" {
resource "iterative_task" "example" {
cloud = "aws" # or any of: gcp, az, k8s
machine = "m"
machine = "m" # medium, or any of: l, xl, m+k80, xl+v100, ...

workdir {
input = "${path.root}/shared"
output = "${path.root}/shared"
storage {
workdir = "."
output = "results"
}
script = <<-END
#!/bin/bash
echo "Hello World!" > greeting.txt
mkdir results
echo "Hello World!" > results/greeting.txt
END
}
```
Expand All @@ -45,8 +45,8 @@ The project layout should look similar to this:
```
project/
├── main.tf
└── shared/
└── ...
└── results/
└── greeting.txt (created in the cloud and downloaded locally)
```

## Initializing Terraform
Expand All @@ -71,7 +71,7 @@ $ terraform apply
This command will:

1. Create all the required cloud resources.
2. Upload the specified shared `input` working directory to the cloud.
2. Upload the specified working directory (`workdir`) to the cloud.
3. Launch the task `script`.

## Viewing Task Statuses
Expand All @@ -93,9 +93,9 @@ $ terraform destroy

This command will:

1. Download the specified shared working directory from the cloud.
1. Download the specified output directory from the cloud.
2. Delete all the cloud resources created by `terraform apply`.

## Viewing Task Results

After running `terraform destroy`, the `shared` directory should contain a file named `greeting.txt` with the text `Hello, World!`
After running `terraform destroy`, the `results` directory should contain a file named `greeting.txt` with the text `Hello, World!`
33 changes: 28 additions & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,23 @@
# Iterative Provider

Use the Iterative Provider to launch resource-intensive tasks in popular cloud providers with a single Terraform file.
![TPI](https://static.iterative.ai/img/cml/banner-terraform.png)
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔔 @casperdcl


[![tests](https://img.shields.io/github/workflow/status/iterative/terraform-provider-iterative/Test?label=tests&logo=GitHub)](https://github.com/iterative/terraform-provider-iterative/actions/workflows/test.yml)

- **Orchestrate Resources**: create cloud compute & storage resources without reading pages of documentation
- **Sync & Execute**: move data & run code in the cloud with minimal configuration
- **Low cost**: auto-recovery from spot/preemptible instances to vastly reduce cost
- **No waste**: auto-cleanup unused resources
- **No lock-in**: switch between cloud vendors with ease

Iterative's Provider is a [Terraform](https://terraform.io) plugin built with machine learning pipelines in mind. It enables full lifecycle management of computing resources (including GPUs) from several cloud vendors:

- Amazon Web Services (AWS)
- Microsoft Azure
- Google Cloud Platform (GCP)
- Kubernetes (K8s)

With a minimal configuration unified across cloud vendors, the aim is to easily move local experiments to the cloud, transparently resume from interrupted low-cost spot instances, and avoid being charged for unused cloud resources (terminate compute instances upon job completion/failure, and remove storage upon download of results).

## Example Usage

Expand All @@ -9,12 +26,18 @@ terraform {
required_providers { iterative = { source = "iterative/iterative" } }
}
provider "iterative" {}
resource "iterative_task" "task" {
cloud = "aws"

resource "iterative_task" "example" {
cloud = "aws" # or any of: gcp, az, k8s
machine = "m" # medium, or any of: l, xl, m+k80, xl+v100, ...

storage {
workdir = "."
output = "results"
}
script = <<-END
#!/bin/bash
echo "hello!"
mkdir results
echo "Hello World!" > results/greeting.txt
END
}
```
Expand Down
37 changes: 24 additions & 13 deletions docs/resources/task.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,34 @@
This resource will:

1. Create cloud resources (machines and storage) for the task.
Comment on lines 3 to 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼 The title of this page is Task Resource but the nav entry is iterative_task. Seems inconsistent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should probably be Resource: iterative_task like in other providers.

2. Upload the given `workdir.input` to the cloud storage.
2. Upload the given `storage.workdir` to the cloud storage.
3. Run the given `script` on the cloud machine until completion or `timeout`.
4. Download results to the given `workdir.output`.
4. Download results to the given `storage.output`.

## Example Usage

```hcl
resource "iterative_task" "task" {
cloud = "aws"
resource "iterative_task" "example" {
name = "example"
cloud = "aws"
machine = "m" # medium, or any of: l, xl, m+k80, xl+v100, ...
image = "ubuntu"
region = "us-east"
disk_size = 30 # GB
spot = 0 # auto-price
parallelism = 1
timeout = 3600 # max 1h idle

environment = { GREETING = "Hello, world!" }
workdir {
input = "${path.root}/shared"
output = "${path.root}/results"
storage {
workdir = "."
output = "results"
}
script = <<-END
#!/bin/bash
echo "$GREETING" | tee $(uuidgen)
echo "$GREETING" | tee results/$(uuidgen)
END
# or: script = file("example.sh")
}
```

Expand All @@ -30,22 +39,24 @@ resource "iterative_task" "task" {
### Required

- `cloud` - (Required) Cloud provider to run the task on; valid values are `aws`, `gcp`, `az` and `k8s`.
- `script` - (Required) Script to run (relative to `workdir.input`); must begin with a valid [shebang](<https://en.wikipedia.org/wiki/Shebang_(Unix)>). Can use a string, including a [heredoc](https://www.terraform.io/docs/language/expressions/strings.html#heredoc-strings), or the contents of a file returned by the [`file`](https://www.terraform.io/docs/language/functions/file.html) function.
- `script` - (Required) Script to run (relative to `storage.workdir`); must begin with a valid [shebang](<https://en.wikipedia.org/wiki/Shebang_(Unix)>). Can use a string, including a [heredoc](https://www.terraform.io/docs/language/expressions/strings.html#heredoc-strings), or the contents of a file returned by the [`file`](https://www.terraform.io/docs/language/functions/file.html) function.

### Optional

- `name` - (Optional) Deterministic task name.
- `region` - (Optional) [Cloud region/zone](#cloud-regions) to run the task on.
- `machine` - (Optional) See [Machine Types](#machine-types) below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Generic Machine Types under "Development" in the nav? (What does Development even refer to here?) Feels like machine types should be under the same section as the iterative_task res ref. somehow.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nav structure idea:

TPI

  • GS
  • Guides
    • Auth
    • Azure K8s
  • Ref
    • Task Res
    • Machine types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as abbreviations aren't part of the proposal, sounds good. The only thing I find rather dissonant is moving the Machine Types page under the Reference section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abbreviations aren't part of the proposal; I was just being lazy.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find rather dissonant is moving the Machine Types page under the Ref

What does "Development" refer to as-is now? Maybe I'm not getting the current intended struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An XY solution would be disguising “Machine types“ as a guide (e.g. how to choose machine types) 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah either put it in the guide or in the ref IMO 😬

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the “Resources” section wat meant isn't a “Reference” in the general sense of the word; it's just a list of resources (hundreds in some providers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move is under “Resources”, but it's rather unorthodox. 🙃

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fine, call the section Resources (just also use that word in links if possible, instead of "reference") And put machine types in the guide.

- `disk_size` - (Optional) Size of the ephemeral machine storage.
- `disk_size` - (Optional) Size of the ephemeral machine storage in GB.
- `spot` - (Optional) Spot instance price. `-1`: disabled, `0`: automatic price, any other positive number: fixed price.
- `image` - (Optional) [Machine image](#machine-images) to run the task with.
- `parallelism` - (Optional) Number of machines to be launched in parallel.
- `workdir.input` - (Optional) Local working directory to upload.
- `workdir.output` - (Optional) Local directory to download results to (default: no download).
- `storage.workdir` - (Optional) Local working directory to upload and use as the `script` working directory.
- `storage.output` - (Optional) Results directory (**relative to `workdir`**) to download (default: no download).
- `environment` - (Optional) Map of environment variable names and values for the task script. Empty string values are replaced with local environment values. Empty values may also be combined with a [glob](<https://en.wikipedia.org/wiki/Glob_(programming)>) name to import all matching variables.
- `timeout` - (Optional) Maximum number of seconds to run before termination.

~> **Note:** `output` is relative to `workdir`, so `storage { workdir = "foo", output = "bar" }` means "upload `./foo/`, change working directory to the uploaded folder, run `script`, and download `bar` (i.e. `./foo/bar`)".

## Attribute Reference

In addition to all arguments above, the following attributes are exported:
Expand Down Expand Up @@ -209,7 +220,7 @@ Setting the `region` attribute results in undefined behaviour.

#### Directory storage

Unlike public cloud providers, Kubernetes does not offer any portable way of persisting and sharing storage between pods. When specified, the `workdir.input` attribute will create a `PersistentVolumeClaim` of the default `StorageClass`, with the same lifecycle as the task and the specified `disk_size`.
Unlike public cloud providers, Kubernetes does not offer any portable way of persisting and sharing storage between pods. When specified, the `storage.workdir` attribute will create a `PersistentVolumeClaim` of the default `StorageClass`, with the same lifecycle as the task and the specified `disk_size`.

~> **Warning:** Access mode will be `ReadWriteOnce` if `parallelism=1` or `ReadWriteMany` otherwise.

Expand Down