-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wave containers not being pulled #85
Comments
I updated Hoping that there was a bug in the previous nextflow version, though I can't rule out that the time delay might mean that we have gotten past any throttle limits before I began the run. I did check the previous logs and confirmed that the token was being sent as expected. |
The previous run is on track to complete, with only merge tasks remaining, despite a series of apparent failures for container pulls (zero execution time) during doublet detection. I am still not sure what is going on exactly, but I will keep this open for monitoring. If we are running into pull limits, one solution may be to batch samples together to reduce the number of jobs. Alternatively, if we can get a way to use the Fusion file system without relying on so heavily on the Wave system, that might be ideal. |
I take it back. The job seems to have stalled, and we are still seeing container pull errors. I'm still not sure what is going on. |
Looking at the actual error on AWS Batch, I see:
So now I wonder if it might be a problem with pulling the base layer that the wave container is built on. Are we maybe hitting a limit for the AWS public ECR? If so, I wonder if we can pass any additional permissions to the nextflow user. I will test adding ECR read permissions and see if that gets us anywhere. |
I can confirm that all the doublet_detection jobs in https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/VcxEXOaURslUi have now actually finished, and without any docker pull errors. So things seem to be good, for now. |
This error seems to be related to throttling of the ECR pull requests from the public channel. I'm somewhat surprised they are throttling pulls to batch instances, but that is my best guess. I did add a few more permissions to the role that is used by the instances ( If this problem recurrs/persists, I think the best option may be to try to reduce the total number of jobs that we send. We can reevaluated if/when we next see throttling errors, I expect |
Sadly, we are again hitting the same errors with the latest (staging) run. I think the next move is probably to adjust the workflows to try to batch jobs into larger work units. I expect to be able to limp along and rerun the workflow to completion with the current settings by running the simulated and real data separately, but this should be addressed for the future. |
I have been testing this in https://github.com/jashapiro/nf-wave-test and was able to regenerate the behavior yesterday with the same I continued onward to do a bit more testing, which included some consolidations in preparation for submission of a bug report, and testing to see if we would still have the same error using our previous AWS batch stack. When I started to test those today, I was excited to see that the error was no longer coming up, meaning that it was likely our new batch stack. I went back to our new batch stack though, and now I am longer getting errors, despite submitting thousands of jobs. I reverted to the tagged version when I was previously getting errors, and I have not been able to recreate errors. At some point I had also update Nextflow to 24.04.4 (from 24.04.3), and while I thought that this could be where the fix happened, when I reverted to the previous version, I was still unable to recreate the error with either version of the profiles. So I am now back to thinking this was a bug at the Seqera end, and perhaps they saw it somehow in logs or other usage. It is also possible that the fix was in a plugin which was not redownloaded when I downgraded, as I don't know how plugins are handled exactly. To account for that eventuality, I upgraded nextflow on the workload server, and will monitor to see if the issue remains resolved. In the meantime, I found a few places where we seem to have settings that are no longer required, which I will be submitting separately. |
Okay, I got it failing again, at the following commit: 41345f8, though it did initially fail with 4f6ec98 when I was wondering if module-specific binaries might be required. It just seems that it may take a lot of jobs (or many runs) to induce failure. I have saved a number of log files now, so it should be possible to start to send out some inquiries with tests. |
Okay, after discussion with the Seqera team on Slack, it does seem that the issue is likely to be ECR API rate limiting. Luckily, this should be fairly straightforward to solve? According to AWS docs that I can see, we will need an account with Tagging @davidsmejia and @jaclyn-taroni for advice/thoughts on next steps. |
We have added credentials to the Seqera account used by the batch workflow to allow login to the public ECR, and that seems to have solved the wave container issue! 🎉 |
I may have declared victory too soon. The errors are recurring. |
I wonder if the better solution may be to add |
I didn't necessarily want to declare victory here yet... I'm looking at https://docs.aws.amazon.com/AmazonECR/latest/public/public-service-quotas.html and wondering if we are even getting ahead of 10 pulls per second at times, which is the max for "Rate of image pulls to AWS resources" and "Rate of authenticated image pulls". The latter can be updated by increasing a service quota, but the former can not, and it is unclear to me which applies "first." What doesn't fully make sense to me though is why retries don't end up falling outside the window; they should be delayed by
But maybe we need to increase the delay there? We'll see if https://github.com/AlexsLemonade/OpenScPCA-infra/issues/52 does make a difference though. |
I am reopening this, as we continue to get errors on large runs. I think I will have to go back to the Seqera folks with more info; Interestingly, when I look at the usage of the AWS creds we have stored with them, they are being accessed, but AWS doesn't seem to be recording them being used for the public ECR access... |
I think the issue may have been how we were storing the credentials on Seqera: I had stored them as AWS credentials, but they may have needed to be stored explicitly as a "Container Registry" with the registry server as Testing now with the credentials moved to that slot. |
Update: I can see that nextflow/wave did access the credentials today! So I think that means they are actually in the right place now and maybe we can really close this! One more set of tests in progress though. |
Looks good, closing, hopefully not to reopen. |
When trying to deploy the v0.1.0 and v0.1.1 releases, we ran into errors where containers were not being pulled as expected, leading to repeated failures. Test runs do just fine, but the later runs started to fail, which seems to suggest that there might still be an issue with rate limits related to wave containers.
While we are populating the
TOWER_ACCESS_TOKEN
environment variable on launch, and we know this is getting accepted because we can monitor the runs on Seqera cloud, I wonder if we might actually need to populatetower.accessToken
configuration variable as well/instead for wave containers specifically. While this would surprise me, I can't think of another reason at the moment.There is also a possibility that there is an error somehow related to wave pulling containers from ECR, though again this seems unlikely to be a root cause.
My plan is to wait and try to deploy the 0.1.1 release one more time, but to first have a more careful look at the log files to try to determine if there is something more specific that I can find.
If this can not be resolved, we may need to abandon the use of wave containers, but I am hoping it does not come to that, and I will try to ask Seqera for assistance before we get to that stage.
The text was updated successfully, but these errors were encountered: