[WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to `v21.06.4` #80

BrunoGrandePhD · 2021-11-24T14:22:34Z

I was originally waiting for a major release before upgrading Tower, but Paolo suggested this patch release might address the issue I reported here. I already deployed this change to Nextflow-dev, and ECS transitioned to the new task definition as expected. I'm now running a test in Tower-dev to see whether the issue has been fixed or not.

Edit: This PR now updates Tower to v21.06.4, which was released to address the following error message that I reported to Seqera. This requires compute environments to be re-created so their configuration can be updated. Hence, I've also updated the Tower configuration script to create versioned compute environments (instead of managing a default CE, which would be more complicated). The latest version is marked as the primary CE (aka the default).

Error when retrieving credentials from container-role: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Connect timeout on endpoint URL: "http://169.254.170.2/v2/credentials/62307d39-2d77-4b25-b9e0-d08c257fe10a"

I'm also taking advantage of this update to the Tower configuration script to deploy a potential fix to WORKFLOWS-96, which has to do with running out of disk space despite EBS autoscaling being enabled. This seems to happen with large files 20-60 GB. Given that multiple jobs can run on a given instance, I suspect that the disk is running out of space before EBS autoscaling can kick in. I suspect that increasing the default EBS volume size from 100 GB (default) to 250 GB should mitigate this issue.

I'm still confirming whether the aforementioned issues are fixed. That said, I think we can start the review in the meantime.

…upgrade-tower-version

thomasyu888

Lgtm

BrunoGrandePhD · 2021-11-30T21:26:54Z

I'm asking for a re-review because I made a significant change to the Tower configuration script. It now creates two compute environments per workspace: one for on-demand instances and another for spot instances. Some of the issues that I've been debugging are related to spot termination, so I figure it would be nice to provide the option to use on-demand instances (ideal for debugging when you don't want to also deal with spot termination issues).

thomasyu888 · 2021-11-30T21:29:18Z

@BrunoGrandePhD Thanks, if there are issues with spot instances currently, should we currently recommend users to use on-demand instances until the issue is resolved?

We actually don't have full control when spot instances are shut down, so if it is a extremely important workflow that we absolutely don't want an instance to go down, should our SOP be to use on-demands instances?

thomasyu888 · 2021-11-30T21:30:01Z

bin/configure-tower-projects.py

@@ -514,16 +503,60 @@ def create_compute_environment(self) -> str:
                        "ec2KeyPair": None,
                        "imageId": None,
                        "securityGroups": [],
-                        "ebsBlockSize": None,
+                        "ebsBlockSize": 250,


Arbitrary number?

The default is 50 GB, and we're having issues with running out of disk space. I wanted to add some buffer, and the largest files that I'm aware of are images from HTAN, which can be as large as 190 GB. I rounded off the number to 250 GB. We might need to increase this if we continue to see disk space issues.

thomasyu888

Minor comments. LGTM

BrunoGrandePhD · 2021-11-30T21:36:55Z

We expect errors due to spot termination, but I confirmed this morning that everything should work fine as long as you enable retries with the following snippet of Nextflow config. I'll be updating the wiki docs with a description of spot vs on-demand and recommendations for each. In general, on-demand is good for debugging to eliminate errors related to spot termination, whereas spot is cost-effective for running an established workflow on a large dataset (as long as retries are enabled to get passed the random errors).

process {
  errorStrategy = 'retry'
  maxRetries = 3
}

BrunoGrandePhD · 2021-12-01T07:20:25Z

Merging this in the evening to avoid any disruption.

Bruno Grande added 3 commits November 24, 2021 06:19

Upgrade Nextflow Tower to v21.06.2

3e6ca77

Upgrade Nextflow Tower to v21.06.4

a5830aa

Version compute environments

fb47b2f

BrunoGrandePhD changed the title ~~Upgrade Nextflow Tower to v21.06.2~~ Upgrade Nextflow Tower to v21.06.4 Nov 29, 2021

BrunoGrandePhD changed the base branch from main to WORKFLOWS-46/tower-cleanup November 29, 2021 19:14

BrunoGrandePhD changed the base branch from WORKFLOWS-46/tower-cleanup to main November 29, 2021 19:14

Merge remote-tracking branch 'origin/main' into bgrande/WORKFLOWS-55/…

bea1144

…upgrade-tower-version

BrunoGrandePhD changed the title ~~Upgrade Nextflow Tower to v21.06.4~~ [WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to v21.06.4 Nov 29, 2021

BrunoGrandePhD marked this pull request as ready for review November 29, 2021 19:18

BrunoGrandePhD requested a review from a team as a code owner November 29, 2021 19:18

BrunoGrandePhD requested a review from tthyer November 29, 2021 19:19

thomasyu888 approved these changes Nov 29, 2021

View reviewed changes

tthyer approved these changes Nov 29, 2021

View reviewed changes

Create spot and on-demand compute environments

e6ef524

BrunoGrandePhD requested review from tthyer and thomasyu888 November 30, 2021 21:24

thomasyu888 reviewed Nov 30, 2021

View reviewed changes

thomasyu888 approved these changes Nov 30, 2021

View reviewed changes

BrunoGrandePhD merged commit 9b46f52 into main Dec 1, 2021

BrunoGrandePhD deleted the bgrande/WORKFLOWS-55/upgrade-tower-version branch December 1, 2021 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to `v21.06.4` #80

[WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to `v21.06.4` #80

BrunoGrandePhD commented Nov 24, 2021 •

edited

Loading

thomasyu888 left a comment

BrunoGrandePhD commented Nov 30, 2021

thomasyu888 commented Nov 30, 2021

thomasyu888 Nov 30, 2021

BrunoGrandePhD Nov 30, 2021

thomasyu888 left a comment

BrunoGrandePhD commented Nov 30, 2021

BrunoGrandePhD commented Dec 1, 2021

[WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to v21.06.4 #80

[WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to v21.06.4 #80

Conversation

BrunoGrandePhD commented Nov 24, 2021 • edited Loading

thomasyu888 left a comment

Choose a reason for hiding this comment

BrunoGrandePhD commented Nov 30, 2021

thomasyu888 commented Nov 30, 2021

thomasyu888 Nov 30, 2021

Choose a reason for hiding this comment

BrunoGrandePhD Nov 30, 2021

Choose a reason for hiding this comment

thomasyu888 left a comment

Choose a reason for hiding this comment

BrunoGrandePhD commented Nov 30, 2021

BrunoGrandePhD commented Dec 1, 2021

[WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to `v21.06.4` #80

[WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to `v21.06.4` #80

BrunoGrandePhD commented Nov 24, 2021 •

edited

Loading