-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs & readme: what/why/how #433
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some accessory comments. Looks great!
@casperdcl Looking good! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great change. Please take a look at my feedback online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more comments/ideas.
README.md
Outdated
```console | ||
make install | ||
``` | ||
- `terraform apply`: launch cloud instance(s), upload `workdir`, and run `script` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be confusing a bit. Should user upload workdir
after terraform apply
? This how I read this for the 1st time.
I'd suggest keeping one line per action/command as we have above. Like:
Launch an training job
terraform apply
This command creates an instance in a cloud based on main.tf
specification. After the instance is running it uploads the current directory (workdir
) and data directory (we don't have this one, right? :) ) to the instance and executes your training script (script
).
In case of spot instances, the cloud will run recovery logic using the auto-scaling group.
Once the script is finishes (or fails) the instance will be terminated automatically. Results and logs are periodicaly syncing to cloud storage.
Like this....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I was trying to avoid a tqdm-style Readme but sure can do 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work e2e and show the use case. Otherwise, we are increasing time-to-value quite a lot. One of the most important metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better now?
|
||
## Development | ||
|
||
### Install Go 1.17+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking... what if a user does not have any ml training script on heirs hands? Or a user has a script but the environment settings will prevent user from executing it? Does it make sense to provide a script in additional to the TF file?
It should reduce the entrance bar quite significantly.
An ideal tutorial should have all the commands, scripts and data I need to run to get a result. Example - dvc tutorial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user does not have any ml training script
I don't follow, do you mean #305?
user has a script but the environment settings will prevent executing
Do you mean the script won't execute locally or won't execute on the cloud? And why won't it execute? Because of missing env vars such as NUM_EPOCHS
or something? Or missing dependencies like numpy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, user might not have any ml training script at all (or it won't work).
In DVC tutorial, user downloads code as the first step. Do we need something similar here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah so an example repo? No we don't have one (yet). Technically we'd probably need 3 - one each for AWS/Azure/GCP.
The current example simply uses AWS and an echo Hello World
script, so the user just copy-pastes the main.tf
and has no other dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd probably need 3 - one each for AWS/Azure/GCP
😬 Unless we use iterative/example-repos-dev or similar, it doesn't look like the best idea from a maintainability standpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd probably need 3 - one each for AWS/Azure/GCP.
I was thinking that TPI is suppose abstract it out 😄 One code file with one "small" data file that user can get with wget
should be enough.
It would be great to have some "realistic" repo with checkpoint etc... like minst but slower :) Fashion-mnist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the only required change is cloud = "whatever"
😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think at a minimum in example repos:
main.tf
:cloud = "X"
README.md
: "How to authenticate/export X_CREDENTIALS
, or use[cloud Y](link to example repo for Y)
or[cloud Z](link to example repo for Z)
"
Plus probably:
requirements.txt
run.py
.github/workflows/cml.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of comments :)
|
||
## Development | ||
|
||
### Install Go 1.17+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd probably need 3 - one each for AWS/Azure/GCP.
I was thinking that TPI is suppose abstract it out 😄 One code file with one "small" data file that user can get with wget
should be enough.
It would be great to have some "realistic" repo with checkpoint etc... like minst but slower :) Fashion-mnist?
README.md
Outdated
```console | ||
make install | ||
``` | ||
- `terraform apply`: launch cloud instance(s), upload `workdir`, and run `script` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work e2e and show the use case. Otherwise, we are increasing time-to-value quite a lot. One of the most important metrics.
Update from the discussion in Slack here. Agreed to use the universal TPI banner with the dark background. |
I'd suggest merging for now @dmpetrov @0x2b3bfa0 since it's probably best to iterate on docs in follow-up issues/PRs. |
Sounds good, but please let's keep track of the unresolved discussions. There are several good points and ideas we'll probably want to consider in the short/mid term. |
naturally :) I think it's just the "end-to-end" point in #363, but let me know if anything else is unresolved. |
|
||
Use the Iterative Provider to launch resource-intensive tasks in popular cloud providers with a single Terraform file. | ||
![TPI](https://static.iterative.ai/img/cml/banner-terraform.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, should be https://static.iterative.ai/img/cml/banner-tpi.svg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
# Iterative Provider [![](https://img.shields.io/badge/-documentation-5c4ee5?logo=terraform)](https://registry.terraform.io/providers/iterative/iterative/latest/docs) | ||
# Terraform Provider Iterative (TPI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's beyond scope but is a preposition missing from the tool's expanded name? I.g. Terraform Provider for Iterative (Tools)?
"Iterative's Terraform Provider" would also be more readable IMO but at this point changing the TLA is prob. out of the question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also on the tool's name: The confusion is reinforced by the official docs site calling it just "iterative provider" (top of nav).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, TPI was named after the repository name, but (cough) doesn't make sense to humans.
|
||
The Iterative Provider makes it easy to: | ||
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. Full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we be more emphatic on why the machine learning focus is important? I guess it's not a common integration or offering out there in the world of TF providers.
Also should we mention it's designed specifically (though not exclusively I think) for certain Iterative tools? Connecting this with CML early could make sense for example. Rn it feels from this description like it's mainly a stand-alone tool for DYI ML?Ops.
- [Getting Started](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) | ||
- [Authentication](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication) | ||
- [Full reference](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.s. also these links open in a separate window unexpectedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Links: an idea is to mention/link the Contributing info. in the repo instead in here.
### Requirements | ||
|
||
## Support | ||
- [Install Terraform 1.0+](https://learn.hashicorp.com/tutorials/terraform/install-cli#install-terraform), e.g.: | ||
- Brew (Homebrew/Mac OS): `brew tap hashicorp/tap && brew install hashicorp/tap/terraform` | ||
- Choco (Chocolatey/Windows): `choco install terraform` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see from the README that the Usage section (Get Started page in docs) comes first. So Authentication should be under GS in the nav, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usage section (Get Started page in docs
Call both Usage or both Get Started, BTW? I'd incline for the latter based on how the content is presented.
|
||
Have a feature request or found a bug? Let us know via [GitHub issues](https://github.com/iterative/terraform-provider-iterative/issues). Have questions? Join our [community on Discord](https://discord.gg/bzA6uY7); we'll be happy to help you get started! | ||
### Define a Task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💅🏼 Docs should also read "Define" I think. Currently the GS reads "Defining".
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | ||
sudo apt-get update && sudo apt-get install terraform | ||
``` | ||
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][authentication] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm looks like all internal links (except nav) open in a separate window. Idk, feels strange to me.
``` | ||
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][authentication] | ||
|
||
[authentication]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would relative links work? E.g. /providers/iterative/iterative/latest/docs/guides/authentication
or even just (../)authentication
above.
## Help | ||
|
||
Run this command after every `make install` to use the new build: | ||
The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[The GS] has some more information.
Does it? They look vey similar at first glance. Why refer to almost the same doc but in another site? Hard to find meaningful differences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could instead link from relevant places in the README to the other docs, for now Azure Kubernetes Service and Generic Machine Types.
@@ -30,22 +39,24 @@ resource "iterative_task" "task" { | |||
### Required | |||
|
|||
- `cloud` - (Required) Cloud provider to run the task on; valid values are `aws`, `gcp`, `az` and `k8s`. | |||
- `script` - (Required) Script to run (relative to `workdir.input`); must begin with a valid [shebang](<https://en.wikipedia.org/wiki/Shebang_(Unix)>). Can use a string, including a [heredoc](https://www.terraform.io/docs/language/expressions/strings.html#heredoc-strings), or the contents of a file returned by the [`file`](https://www.terraform.io/docs/language/functions/file.html) function. | |||
- `script` - (Required) Script to run (relative to `storage.workdir`); must begin with a valid [shebang](<https://en.wikipedia.org/wiki/Shebang_(Unix)>). Can use a string, including a [heredoc](https://www.terraform.io/docs/language/expressions/strings.html#heredoc-strings), or the contents of a file returned by the [`file`](https://www.terraform.io/docs/language/functions/file.html) function. | |||
|
|||
### Optional | |||
|
|||
- `name` - (Optional) Deterministic task name. | |||
- `region` - (Optional) [Cloud region/zone](#cloud-regions) to run the task on. | |||
- `machine` - (Optional) See [Machine Types](#machine-types) below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is Generic Machine Types under "Development" in the nav? (What does Development even refer to here?) Feels like machine types should be under the same section as the iterative_task
res ref. somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nav structure idea:
TPI
- GS
- Guides
- Auth
- Azure K8s
- Ref
- Task Res
- Machine types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as abbreviations aren't part of the proposal, sounds good. The only thing I find rather dissonant is moving the Machine Types page under the Reference section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abbreviations aren't part of the proposal; I was just being lazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find rather dissonant is moving the Machine Types page under the Ref
What does "Development" refer to as-is now? Maybe I'm not getting the current intended struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An XY solution would be disguising “Machine types“ as a guide (e.g. how to choose machine types) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah either put it in the guide or in the ref IMO 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that the “Resources” section wat meant isn't a “Reference” in the general sense of the word; it's just a list of resources (hundreds in some providers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move is under “Resources”, but it's rather unorthodox. 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's fine, call the section Resources (just also use that word in links if possible, instead of "reference") And put machine types in the guide.
This resource will: | ||
|
||
1. Create cloud resources (machines and storage) for the task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💅🏼 The title of this page is Task Resource but the nav entry is iterative_task
. Seems inconsistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should probably be Resource: iterative_task
like in other providers.
@jorgeorpinel, even if your review is a bit late (post–merge), you make lots of good points that should be considered and addressed. 😅 |
@casperdcl, should this be discussed here and then applied with a separate pull request? |
task
: makeworkdir.output
more intuitive #414 (or revert be6aa67)fixes #415