Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to v21.06.4 #80

Merged
merged 5 commits into from
Dec 1, 2021

Conversation

BrunoGrandePhD
Copy link
Contributor

@BrunoGrandePhD BrunoGrandePhD commented Nov 24, 2021

I was originally waiting for a major release before upgrading Tower, but Paolo suggested this patch release might address the issue I reported here. I already deployed this change to Nextflow-dev, and ECS transitioned to the new task definition as expected. I'm now running a test in Tower-dev to see whether the issue has been fixed or not.

Edit: This PR now updates Tower to v21.06.4, which was released to address the following error message that I reported to Seqera. This requires compute environments to be re-created so their configuration can be updated. Hence, I've also updated the Tower configuration script to create versioned compute environments (instead of managing a default CE, which would be more complicated). The latest version is marked as the primary CE (aka the default).

Error when retrieving credentials from container-role: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Connect timeout on endpoint URL: "http://169.254.170.2/v2/credentials/62307d39-2d77-4b25-b9e0-d08c257fe10a"

I'm also taking advantage of this update to the Tower configuration script to deploy a potential fix to WORKFLOWS-96, which has to do with running out of disk space despite EBS autoscaling being enabled. This seems to happen with large files 20-60 GB. Given that multiple jobs can run on a given instance, I suspect that the disk is running out of space before EBS autoscaling can kick in. I suspect that increasing the default EBS volume size from 100 GB (default) to 250 GB should mitigate this issue.

I'm still confirming whether the aforementioned issues are fixed. That said, I think we can start the review in the meantime.

@BrunoGrandePhD BrunoGrandePhD changed the title Upgrade Nextflow Tower to v21.06.2 Upgrade Nextflow Tower to v21.06.4 Nov 29, 2021
@BrunoGrandePhD BrunoGrandePhD changed the base branch from main to WORKFLOWS-46/tower-cleanup November 29, 2021 19:14
@BrunoGrandePhD BrunoGrandePhD changed the base branch from WORKFLOWS-46/tower-cleanup to main November 29, 2021 19:14
@BrunoGrandePhD BrunoGrandePhD changed the title Upgrade Nextflow Tower to v21.06.4 [WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to v21.06.4 Nov 29, 2021
@BrunoGrandePhD BrunoGrandePhD marked this pull request as ready for review November 29, 2021 19:18
@BrunoGrandePhD BrunoGrandePhD requested a review from a team as a code owner November 29, 2021 19:18
Copy link
Collaborator

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@BrunoGrandePhD
Copy link
Contributor Author

I'm asking for a re-review because I made a significant change to the Tower configuration script. It now creates two compute environments per workspace: one for on-demand instances and another for spot instances. Some of the issues that I've been debugging are related to spot termination, so I figure it would be nice to provide the option to use on-demand instances (ideal for debugging when you don't want to also deal with spot termination issues).

@thomasyu888
Copy link
Collaborator

@BrunoGrandePhD Thanks, if there are issues with spot instances currently, should we currently recommend users to use on-demand instances until the issue is resolved?

We actually don't have full control when spot instances are shut down, so if it is a extremely important workflow that we absolutely don't want an instance to go down, should our SOP be to use on-demands instances?

@@ -514,16 +503,60 @@ def create_compute_environment(self) -> str:
"ec2KeyPair": None,
"imageId": None,
"securityGroups": [],
"ebsBlockSize": None,
"ebsBlockSize": 250,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arbitrary number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is 50 GB, and we're having issues with running out of disk space. I wanted to add some buffer, and the largest files that I'm aware of are images from HTAN, which can be as large as 190 GB. I rounded off the number to 250 GB. We might need to increase this if we continue to see disk space issues.

Copy link
Collaborator

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. LGTM

@BrunoGrandePhD
Copy link
Contributor Author

We expect errors due to spot termination, but I confirmed this morning that everything should work fine as long as you enable retries with the following snippet of Nextflow config. I'll be updating the wiki docs with a description of spot vs on-demand and recommendations for each. In general, on-demand is good for debugging to eliminate errors related to spot termination, whereas spot is cost-effective for running an established workflow on a large dataset (as long as retries are enabled to get passed the random errors).

process {
  errorStrategy = 'retry'
  maxRetries = 3
}

@BrunoGrandePhD
Copy link
Contributor Author

Merging this in the evening to avoid any disruption.

@BrunoGrandePhD BrunoGrandePhD merged commit 9b46f52 into main Dec 1, 2021
@BrunoGrandePhD BrunoGrandePhD deleted the bgrande/WORKFLOWS-55/upgrade-tower-version branch December 1, 2021 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants