-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to v21.06.4
#80
[WORKFLOWS-55 | WORKFLOWS-96] Upgrade Nextflow Tower to v21.06.4
#80
Conversation
v21.06.2
v21.06.4
…upgrade-tower-version
v21.06.4
v21.06.4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
I'm asking for a re-review because I made a significant change to the Tower configuration script. It now creates two compute environments per workspace: one for on-demand instances and another for spot instances. Some of the issues that I've been debugging are related to spot termination, so I figure it would be nice to provide the option to use on-demand instances (ideal for debugging when you don't want to also deal with spot termination issues). |
@BrunoGrandePhD Thanks, if there are issues with spot instances currently, should we currently recommend users to use on-demand instances until the issue is resolved? We actually don't have full control when spot instances are shut down, so if it is a extremely important workflow that we absolutely don't want an instance to go down, should our SOP be to use on-demands instances? |
@@ -514,16 +503,60 @@ def create_compute_environment(self) -> str: | |||
"ec2KeyPair": None, | |||
"imageId": None, | |||
"securityGroups": [], | |||
"ebsBlockSize": None, | |||
"ebsBlockSize": 250, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arbitrary number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is 50 GB, and we're having issues with running out of disk space. I wanted to add some buffer, and the largest files that I'm aware of are images from HTAN, which can be as large as 190 GB. I rounded off the number to 250 GB. We might need to increase this if we continue to see disk space issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. LGTM
We expect errors due to spot termination, but I confirmed this morning that everything should work fine as long as you enable retries with the following snippet of Nextflow config. I'll be updating the wiki docs with a description of spot vs on-demand and recommendations for each. In general, on-demand is good for debugging to eliminate errors related to spot termination, whereas spot is cost-effective for running an established workflow on a large dataset (as long as retries are enabled to get passed the random errors).
|
Merging this in the evening to avoid any disruption. |
I was originally waiting for a major release before upgrading Tower, but Paolo suggested this patch release might address the issue I reported here. I already deployed this change to Nextflow-dev, and ECS transitioned to the new task definition as expected. I'm now running a test in Tower-dev to see whether the issue has been fixed or not.
Edit: This PR now updates Tower to
v21.06.4
, which was released to address the following error message that I reported to Seqera. This requires compute environments to be re-created so their configuration can be updated. Hence, I've also updated the Tower configuration script to create versioned compute environments (instead of managing a default CE, which would be more complicated). The latest version is marked as the primary CE (aka the default).I'm also taking advantage of this update to the Tower configuration script to deploy a potential fix to WORKFLOWS-96, which has to do with running out of disk space despite EBS autoscaling being enabled. This seems to happen with large files 20-60 GB. Given that multiple jobs can run on a given instance, I suspect that the disk is running out of space before EBS autoscaling can kick in. I suspect that increasing the default EBS volume size from 100 GB (default) to 250 GB should mitigate this issue.
I'm still confirming whether the aforementioned issues are fixed. That said, I think we can start the review in the meantime.