Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs & readme: what/why/how #433

Merged
merged 33 commits into from
Mar 15, 2022
Merged

docs & readme: what/why/how #433

merged 33 commits into from
Mar 15, 2022

Conversation

casperdcl
Copy link
Contributor

@casperdcl casperdcl commented Mar 8, 2022

fixes #415

@casperdcl casperdcl added the documentation Markdown files label Mar 8, 2022
@casperdcl casperdcl self-assigned this Mar 8, 2022
@casperdcl casperdcl marked this pull request as ready for review March 8, 2022 17:47
@casperdcl casperdcl mentioned this pull request Mar 8, 2022
8 tasks
@casperdcl casperdcl requested review from a team and dmpetrov March 8, 2022 19:56
Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some accessory comments. Looks great!

docs/resources/task.md Outdated Show resolved Hide resolved
docs/guides/getting-started.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@DavidGOrtega
Copy link
Contributor

@casperdcl Looking good!

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change. Please take a look at my feedback online.

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more comments/ideas.

README.md Show resolved Hide resolved
README.md Outdated
```console
make install
```
- `terraform apply`: launch cloud instance(s), upload `workdir`, and run `script`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be confusing a bit. Should user upload workdir after terraform apply? This how I read this for the 1st time.

I'd suggest keeping one line per action/command as we have above. Like:

Launch an training job

terraform apply

This command creates an instance in a cloud based on main.tf specification. After the instance is running it uploads the current directory (workdir) and data directory (we don't have this one, right? :) ) to the instance and executes your training script (script).

In case of spot instances, the cloud will run recovery logic using the auto-scaling group.

Once the script is finishes (or fails) the instance will be terminated automatically. Results and logs are periodicaly syncing to cloud storage.

Like this....

Copy link
Contributor Author

@casperdcl casperdcl Mar 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I was trying to avoid a tqdm-style Readme but sure can do 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work e2e and show the use case. Otherwise, we are increasing time-to-value quite a lot. One of the most important metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better now?


## Development

### Install Go 1.17+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking... what if a user does not have any ml training script on heirs hands? Or a user has a script but the environment settings will prevent user from executing it? Does it make sense to provide a script in additional to the TF file?

It should reduce the entrance bar quite significantly.

An ideal tutorial should have all the commands, scripts and data I need to run to get a result. Example - dvc tutorial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user does not have any ml training script

I don't follow, do you mean #305?

user has a script but the environment settings will prevent executing

Do you mean the script won't execute locally or won't execute on the cloud? And why won't it execute? Because of missing env vars such as NUM_EPOCHS or something? Or missing dependencies like numpy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, user might not have any ml training script at all (or it won't work).
In DVC tutorial, user downloads code as the first step. Do we need something similar here?

Copy link
Contributor Author

@casperdcl casperdcl Mar 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so an example repo? No we don't have one (yet). Technically we'd probably need 3 - one each for AWS/Azure/GCP.

The current example simply uses AWS and an echo Hello World script, so the user just copy-pastes the main.tf and has no other dependencies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd probably need 3 - one each for AWS/Azure/GCP

😬 Unless we use iterative/example-repos-dev or similar, it doesn't look like the best idea from a maintainability standpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd probably need 3 - one each for AWS/Azure/GCP.

I was thinking that TPI is suppose abstract it out 😄 One code file with one "small" data file that user can get with wget should be enough.

It would be great to have some "realistic" repo with checkpoint etc... like minst but slower :) Fashion-mnist?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only required change is cloud = "whatever" 😅

Copy link
Contributor Author

@casperdcl casperdcl Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think at a minimum in example repos:

  • main.tf: cloud = "X"
  • README.md: "How to authenticate/export X_CREDENTIALS, or use [cloud Y](link to example repo for Y) or [cloud Z](link to example repo for Z)"

Plus probably:

  • requirements.txt
  • run.py
  • .github/workflows/cml.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think this complexity is needed in the readme here. Right now it's a self-contained1 example that supports users both with and without a script file.

Footnotes

  1. apart from the how-to-setup-credentials external link

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of comments :)


## Development

### Install Go 1.17+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd probably need 3 - one each for AWS/Azure/GCP.

I was thinking that TPI is suppose abstract it out 😄 One code file with one "small" data file that user can get with wget should be enough.

It would be great to have some "realistic" repo with checkpoint etc... like minst but slower :) Fashion-mnist?

README.md Outdated
```console
make install
```
- `terraform apply`: launch cloud instance(s), upload `workdir`, and run `script`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work e2e and show the use case. Otherwise, we are increasing time-to-value quite a lot. One of the most important metrics.

@arcticbear
Copy link

Update from the discussion in Slack here. Agreed to use the universal TPI banner with the dark background.

terraform-iterative-dark-bg-3

@casperdcl
Copy link
Contributor Author

I'd suggest merging for now @dmpetrov @0x2b3bfa0 since it's probably best to iterate on docs in follow-up issues/PRs.

@casperdcl casperdcl merged commit fcc25f3 into master Mar 15, 2022
@casperdcl casperdcl deleted the highlights branch March 15, 2022 14:49
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 15, 2022

Sounds good, but please let's keep track of the unresolved discussions. There are several good points and ideas we'll probably want to consider in the short/mid term.

@casperdcl casperdcl mentioned this pull request Mar 15, 2022
21 tasks
@casperdcl
Copy link
Contributor Author

naturally :)

I think it's just the "end-to-end" point in #363, but let me know if anything else is unresolved.


Use the Iterative Provider to launch resource-intensive tasks in popular cloud providers with a single Terraform file.
![TPI](https://static.iterative.ai/img/cml/banner-terraform.png)
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔔 @casperdcl


# Iterative Provider [![](https://img.shields.io/badge/-documentation-5c4ee5?logo=terraform)](https://registry.terraform.io/providers/iterative/iterative/latest/docs)
# Terraform Provider Iterative (TPI)
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's beyond scope but is a preposition missing from the tool's expanded name? I.g. Terraform Provider for Iterative (Tools)?

"Iterative's Terraform Provider" would also be more readable IMO but at this point changing the TLA is prob. out of the question.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also on the tool's name: The confusion is reinforced by the official docs site calling it just "iterative provider" (top of nav).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, TPI was named after the repository name, but (cough) doesn't make sense to humans.


The Iterative Provider makes it easy to:
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. Full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we be more emphatic on why the machine learning focus is important? I guess it's not a common integration or offering out there in the world of TF providers.

Also should we mention it's designed specifically (though not exclusively I think) for certain Iterative tools? Connecting this with CML early could make sense for example. Rn it feels from this description like it's mainly a stand-alone tool for DYI ML?Ops.

Comment on lines +25 to +27
- [Getting Started](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started)
- [Authentication](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication)
- [Full reference](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task)
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These links are in a special arrangement different from the left-side content structure. Causes confusion as to what's more important info. "Full reference" is not even on the left side (at least with that title).

image
vs
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. also these links open in a separate window unexpectedly.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Links: an idea is to mention/link the Contributing info. in the repo instead in here.

Comment on lines +26 to +30
### Requirements

## Support
- [Install Terraform 1.0+](https://learn.hashicorp.com/tutorials/terraform/install-cli#install-terraform), e.g.:
- Brew (Homebrew/Mac OS): `brew tap hashicorp/tap && brew install hashicorp/tap/terraform`
- Choco (Chocolatey/Windows): `choco install terraform`
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see from the README that the Usage section (Get Started page in docs) comes first. So Authentication should be under GS in the nav, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage section (Get Started page in docs

Call both Usage or both Get Started, BTW? I'd incline for the latter based on how the content is presented.


Have a feature request or found a bug? Let us know via [GitHub issues](https://github.com/iterative/terraform-provider-iterative/issues). Have questions? Join our [community on Discord](https://discord.gg/bzA6uY7); we'll be happy to help you get started!
### Define a Task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼 Docs should also read "Define" I think. Currently the GS reads "Defining".

sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt-get update && sudo apt-get install terraform
```
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][authentication]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm looks like all internal links (except nav) open in a separate window. Idk, feels strange to me.

```
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][authentication]

[authentication]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would relative links work? E.g. /providers/iterative/iterative/latest/docs/guides/authentication or even just (../)authentication above.

Comment on lines +103 to +105
## Help

Run this command after every `make install` to use the new build:
The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[The GS] has some more information.

Does it? They look vey similar at first glance. Why refer to almost the same doc but in another site? Hard to find meaningful differences

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could instead link from relevant places in the README to the other docs, for now Azure Kubernetes Service and Generic Machine Types.

@@ -30,22 +39,24 @@ resource "iterative_task" "task" {
### Required

- `cloud` - (Required) Cloud provider to run the task on; valid values are `aws`, `gcp`, `az` and `k8s`.
- `script` - (Required) Script to run (relative to `workdir.input`); must begin with a valid [shebang](<https://en.wikipedia.org/wiki/Shebang_(Unix)>). Can use a string, including a [heredoc](https://www.terraform.io/docs/language/expressions/strings.html#heredoc-strings), or the contents of a file returned by the [`file`](https://www.terraform.io/docs/language/functions/file.html) function.
- `script` - (Required) Script to run (relative to `storage.workdir`); must begin with a valid [shebang](<https://en.wikipedia.org/wiki/Shebang_(Unix)>). Can use a string, including a [heredoc](https://www.terraform.io/docs/language/expressions/strings.html#heredoc-strings), or the contents of a file returned by the [`file`](https://www.terraform.io/docs/language/functions/file.html) function.

### Optional

- `name` - (Optional) Deterministic task name.
- `region` - (Optional) [Cloud region/zone](#cloud-regions) to run the task on.
- `machine` - (Optional) See [Machine Types](#machine-types) below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Generic Machine Types under "Development" in the nav? (What does Development even refer to here?) Feels like machine types should be under the same section as the iterative_task res ref. somehow.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nav structure idea:

TPI

  • GS
  • Guides
    • Auth
    • Azure K8s
  • Ref
    • Task Res
    • Machine types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as abbreviations aren't part of the proposal, sounds good. The only thing I find rather dissonant is moving the Machine Types page under the Reference section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abbreviations aren't part of the proposal; I was just being lazy.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find rather dissonant is moving the Machine Types page under the Ref

What does "Development" refer to as-is now? Maybe I'm not getting the current intended struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An XY solution would be disguising “Machine types“ as a guide (e.g. how to choose machine types) 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah either put it in the guide or in the ref IMO 😬

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the “Resources” section wat meant isn't a “Reference” in the general sense of the word; it's just a list of resources (hundreds in some providers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move is under “Resources”, but it's rather unorthodox. 🙃

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fine, call the section Resources (just also use that word in links if possible, instead of "reference") And put machine types in the guide.

Comment on lines 3 to 5
This resource will:

1. Create cloud resources (machines and storage) for the task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼 The title of this page is Task Resource but the nav entry is iterative_task. Seems inconsistent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should probably be Resource: iterative_task like in other providers.

@0x2b3bfa0
Copy link
Member

@jorgeorpinel, even if your review is a bit late (post–merge), you make lots of good points that should be considered and addressed. 😅

@0x2b3bfa0
Copy link
Member

@casperdcl, should this be discussed here and then applied with a separate pull request?

@casperdcl casperdcl mentioned this pull request Mar 24, 2022
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Markdown files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

basic docs improvements
7 participants