Allow tasks to destroy themselves #289

0x2b3bfa0 · 2021-11-24T15:59:44Z

With the current implementation, instances can't destroy all the supporting resources, because of interdependency. For example, after deleting a security group, it's impossible to issue more API calls because there is no network connection.

Possible solutions include:

Using cloud-native templates like AWS CloudFormation, Google Cloud Deployment
Manager and Azure Resource Templates to let providers destroy everything.
Leaving cheap and costless resources in the cloud, and running a garbage
collector in every invocation to delete resources from past tasks.
Requiring users to explicitly delete resources after each task. This approach
is convenient with the launch/harvest lifecycle, but not for the CML runner.

dacbd · 2021-11-24T16:08:33Z

Do you have a good list of the particular resources that cause this issue?

My two cents is to allow those resources to be predefined by the user (vpc, security grp rules, etc) so tpi doesn't need to clean them and users can have that extra control.

0x2b3bfa0 · 2022-03-13T23:55:22Z

Alternatively, why not add an explicit cleanup step at the end of workflows?

on: workflow_dispatch
jobs:
  create:
    runs-on: ubuntu-latest
    steps:
      - uses: iterative/setup-cml@v1
      - run: cml runner create ${{ github.run_id }}
  reproduce:
    needs: create
    runs-on: self-hoster
    steps:
      - uses: iterative/setup-dvc@v1
      - run: dvc repro
  delete:
    if: always()
    needs: reproduce
    runs-on: ubuntu-latest
    steps:
      - uses: iterative/setup-cml@v1
      - run: cml runner delete ${{ github.run_id }}

On GitHub Actions, this can even be automated by using the post functionality.

dacbd · 2022-06-06T15:31:00Z

On GitHub Actions, this can even be automated by using the post functionality.

Not quite since the post is on the setup level and runs on the same host that was defined for the job. So it would either delete the instance right after it was made or would run on the instance and have the same problem.

the create, run, delete/clean-up as separate jobs could work but that starts to feel pretty opinionated (forced convention) on your ci/cd scripts. I'm not super opposed but I feel more insight/opinions on it should be gathered.

0x2b3bfa0 · 2022-06-07T10:55:59Z

Not quite since the post is on the setup level and runs on the same host that was defined for the job. So it would either delete the instance right after it was made or would run on the instance and have the same problem.

Oh, my! 🤦🏼‍♂️ Yes, you're absolutely right.

0x2b3bfa0 · 2022-06-07T10:58:22Z

the create, run, delete/clean-up as separate jobs could work but that starts to feel pretty opinionated (forced convention) on your ci/cd scripts.

As opinionated as requiring separate deploy and train jobs as we already do, but slightly more bulky? 😅

I'm not super opposed but I feel more insight/opinions on it should be gathered.

Definitely, and we should also explore webhook-based scaling solutions like the ones proposed at https://docs.github.com/es/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners

casperdcl · 2022-06-09T10:44:37Z

is if: always() or equivalent supported in all CIs?
the extra bulkiness sounds fine provided it's optional-to-fix-edge-cases rather than default-required
in any case a cml runner gc [--all] (or equivalent) standalone command sounds nice to have as a prerequisite

0x2b3bfa0 added enhancement New feature or request resource-task iterative_task TF resource labels Nov 24, 2021

This was referenced Nov 25, 2021

task After task completion computing resources should be released (destroyed) #302

Open

task recovering two tasks specifying a different folder ends in error #300

Closed

DavidGOrtega added the p1-important High priority label Nov 29, 2021

0x2b3bfa0 mentioned this issue Nov 29, 2021

task bucket usage vs "directory" within a bucket #299

Closed

casperdcl mentioned this issue Jan 12, 2022

improve data sync features #362

Open

9 tasks

0x2b3bfa0 mentioned this issue Apr 24, 2022

Improve the AWS VPC architecture #107

Closed

0x2b3bfa0 mentioned this issue Jun 6, 2022

runner gcp firewall quota #604

Open

0x2b3bfa0 mentioned this issue Oct 19, 2022

Support reusing existing storage containers across task providers #687

Merged

casperdcl mentioned this issue Oct 28, 2022

Losing network for a while can endup with the runner running forever (GH at least) iterative/cml#1014

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow tasks to destroy themselves #289

Allow tasks to destroy themselves #289

0x2b3bfa0 commented Nov 24, 2021 •

edited

Loading

dacbd commented Nov 24, 2021

0x2b3bfa0 commented Mar 13, 2022 •

edited

Loading

dacbd commented Jun 6, 2022

0x2b3bfa0 commented Jun 7, 2022

0x2b3bfa0 commented Jun 7, 2022

casperdcl commented Jun 9, 2022

Allow tasks to destroy themselves #289

Allow tasks to destroy themselves #289

Comments

0x2b3bfa0 commented Nov 24, 2021 • edited Loading

dacbd commented Nov 24, 2021

0x2b3bfa0 commented Mar 13, 2022 • edited Loading

dacbd commented Jun 6, 2022

0x2b3bfa0 commented Jun 7, 2022

0x2b3bfa0 commented Jun 7, 2022

casperdcl commented Jun 9, 2022

0x2b3bfa0 commented Nov 24, 2021 •

edited

Loading

0x2b3bfa0 commented Mar 13, 2022 •

edited

Loading