Wave containers not being pulled #85

jashapiro · 2024-07-29T17:07:07Z

When trying to deploy the v0.1.0 and v0.1.1 releases, we ran into errors where containers were not being pulled as expected, leading to repeated failures. Test runs do just fine, but the later runs started to fail, which seems to suggest that there might still be an issue with rate limits related to wave containers.

While we are populating the TOWER_ACCESS_TOKEN environment variable on launch, and we know this is getting accepted because we can monitor the runs on Seqera cloud, I wonder if we might actually need to populate tower.accessToken configuration variable as well/instead for wave containers specifically. While this would surprise me, I can't think of another reason at the moment.

There is also a possibility that there is an error somehow related to wave pulling containers from ECR, though again this seems unlikely to be a root cause.

My plan is to wait and try to deploy the 0.1.1 release one more time, but to first have a more careful look at the log files to try to determine if there is something more specific that I can find.

If this can not be resolved, we may need to abandon the use of wave containers, but I am hoping it does not come to that, and I will try to ask Seqera for assistance before we get to that stage.

The text was updated successfully, but these errors were encountered:

jashapiro · 2024-07-29T19:29:01Z

I updated nextflow to on the AWS workload machine to the latest version, 24.04.3, and triggered a manual run for workflow version v0.1.1 (only the real data, as the simulations had gone fine). That is currently running at https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/4Rrf1pyhMkOBw6 and seems to be working as expected, pulling images as expected.

Hoping that there was a bug in the previous nextflow version, though I can't rule out that the time delay might mean that we have gotten past any throttle limits before I began the run.

I did check the previous logs and confirmed that the token was being sent as expected.

jashapiro · 2024-07-30T10:39:51Z

The previous run is on track to complete, with only merge tasks remaining, despite a series of apparent failures for container pulls (zero execution time) during doublet detection. I am still not sure what is going on exactly, but I will keep this open for monitoring.

If we are running into pull limits, one solution may be to batch samples together to reduce the number of jobs.

Alternatively, if we can get a way to use the Fusion file system without relying on so heavily on the Wave system, that might be ideal.

jashapiro · 2024-07-30T12:38:45Z

I take it back. The job seems to have stalled, and we are still seeing container pull errors. I'm still not sure what is going on.

jashapiro · 2024-07-30T13:33:13Z

Looking at the actual error on AWS Batch, I see:

CannotPullContainerError: Error response from daemon: unknown: repository 'public.ecr.aws/openscpca/doublet-detection:v0.1.0' bad request (400)

So now I wonder if it might be a problem with pulling the base layer that the wave container is built on.

Are we maybe hitting a limit for the AWS public ECR? If so, I wonder if we can pass any additional permissions to the nextflow user. I will test adding ECR read permissions and see if that gets us anywhere.

jashapiro · 2024-07-30T16:19:40Z

I can confirm that all the doublet_detection jobs in https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/VcxEXOaURslUi have now actually finished, and without any docker pull errors.

So things seem to be good, for now.

jashapiro · 2024-08-05T16:37:07Z

This error seems to be related to throttling of the ECR pull requests from the public channel. I'm somewhat surprised they are throttling pulls to batch instances, but that is my best guess. I did add a few more permissions to the role that is used by the instances (SSM-core), but I am not sure whether that made the difference.

If this problem recurrs/persists, I think the best option may be to try to reduce the total number of jobs that we send.
This w mean updating/modifying processes to allow them to take sets of samples, rather than only one at a time.
We could revert to expecting modules to take a project at a time, though this might create more unbalanced loads, or add some functionality to co create lists of values as inputs based on collate() or buffer() to create the sets of inputs, probably followed by a `transpose() operation. I haven't fully thought this through though!

We can reevaluated if/when we next see throttling errors, I expect

jashapiro · 2024-08-15T21:17:05Z

Looking at the actual error on AWS Batch, I see:

CannotPullContainerError: Error response from daemon: unknown: repository 'public.ecr.aws/openscpca/doublet-detection:v0.1.0' bad request (400)

Sadly, we are again hitting the same errors with the latest (staging) run. I think the next move is probably to adjust the workflows to try to batch jobs into larger work units. I expect to be able to limp along and rerun the workflow to completion with the current settings by running the simulated and real data separately, but this should be addressed for the future.

jashapiro · 2024-09-10T18:47:51Z

I have been testing this in https://github.com/jashapiro/nf-wave-test and was able to regenerate the behavior yesterday with the same CannotPullContainerError. At the time I was (I think) at the throttles-somteimes tag. I submitted 1000 jobs cleanly in an initial, but then when I submitted more I immediately started to get the pull error.

I continued onward to do a bit more testing, which included some consolidations in preparation for submission of a bug report, and testing to see if we would still have the same error using our previous AWS batch stack. When I started to test those today, I was excited to see that the error was no longer coming up, meaning that it was likely our new batch stack. I went back to our new batch stack though, and now I am longer getting errors, despite submitting thousands of jobs.

I reverted to the tagged version when I was previously getting errors, and I have not been able to recreate errors.

At some point I had also update Nextflow to 24.04.4 (from 24.04.3), and while I thought that this could be where the fix happened, when I reverted to the previous version, I was still unable to recreate the error with either version of the profiles.

So I am now back to thinking this was a bug at the Seqera end, and perhaps they saw it somehow in logs or other usage. It is also possible that the fix was in a plugin which was not redownloaded when I downgraded, as I don't know how plugins are handled exactly. To account for that eventuality, I upgraded nextflow on the workload server, and will monitor to see if the issue remains resolved.

In the meantime, I found a few places where we seem to have settings that are no longer required, which I will be submitting separately.

jashapiro · 2024-09-11T20:59:01Z

Okay, I got it failing again, at the following commit: 41345f8, though it did initially fail with 4f6ec98 when I was wondering if module-specific binaries might be required.

It just seems that it may take a lot of jobs (or many runs) to induce failure. I have saved a number of log files now, so it should be possible to start to send out some inquiries with tests.

jashapiro · 2024-09-12T16:44:55Z

Okay, after discussion with the Seqera team on Slack, it does seem that the issue is likely to be ECR API rate limiting. Luckily, this should be fairly straightforward to solve? According to AWS docs that I can see, we will need an account with ecr-public:GetAuthorizationToken and sts:GetServiceBearerToken to allow it to login to the public ECR for authenticated pulls, which vastly increases the rate limit. (I had kind of assumed that maybe other ecr-public:Get/Describe privs would be required, but maybe not, since it is public? we can test this) As far as I can tell, this does not need to be an account with any particular access to other resources, which may be useful as I believe we do need to be able to provide an access key and secret key, as SSO is not supported (though we can use a role, if desired).

Tagging @davidsmejia and @jaclyn-taroni for advice/thoughts on next steps.

jashapiro · 2024-09-13T17:37:58Z

We have added credentials to the Seqera account used by the batch workflow to allow login to the public ECR, and that seems to have solved the wave container issue! 🎉

jashapiro · 2024-11-14T22:58:41Z

I may have declared victory too soon. The errors are recurring.

jashapiro · 2024-11-14T23:09:56Z

I wonder if the better solution may be to add arn:aws:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly to the openscpca-nf roles? (I thought that I had added these permissions to the appropriate roles, but maybe not)

jashapiro · 2024-11-15T15:04:10Z

I didn't necessarily want to declare victory here yet...

I'm looking at https://docs.aws.amazon.com/AmazonECR/latest/public/public-service-quotas.html and wondering if we are even getting ahead of 10 pulls per second at times, which is the max for "Rate of image pulls to AWS resources" and "Rate of authenticated image pulls". The latter can be updated by increasing a service quota, but the former can not, and it is unclear to me which applies "first." What doesn't fully make sense to me though is why retries don't end up falling outside the window; they should be delayed by

OpenScPCA-nf/config/process_base.config

Line 8 in cb5b181

    
           sleep(Math.pow(2, task.attempt) * 200 as long) // sleep to allow for transient errors

But maybe we need to increase the delay there?

We'll see if https://github.com/AlexsLemonade/OpenScPCA-infra/issues/52 does make a difference though.

jashapiro · 2024-12-23T21:43:48Z

I am reopening this, as we continue to get errors on large runs.

I think I will have to go back to the Seqera folks with more info; Interestingly, when I look at the usage of the AWS creds we have stored with them, they are being accessed, but AWS doesn't seem to be recording them being used for the public ECR access...

jashapiro · 2024-12-24T15:56:56Z

I think the issue may have been how we were storing the credentials on Seqera: I had stored them as AWS credentials, but they may have needed to be stored explicitly as a "Container Registry" with the registry server as public.ecr.aws.

Testing now with the credentials moved to that slot.

jashapiro · 2024-12-24T16:19:59Z

Update: I can see that nextflow/wave did access the credentials today! So I think that means they are actually in the right place now and maybe we can really close this! One more set of tests in progress though.

jashapiro · 2024-12-24T19:15:37Z

Looks good, closing, hopefully not to reopen.

jaclyn-taroni assigned jashapiro Jul 30, 2024

jashapiro mentioned this issue Aug 16, 2024

Allow nextflow -resume on workflow runs #92

Closed

jashapiro mentioned this issue Sep 10, 2024

Minor updates to configs for local & batch #94

Merged

jashapiro closed this as completed Sep 13, 2024

jashapiro reopened this Nov 14, 2024

jashapiro closed this as completed Nov 15, 2024

jashapiro reopened this Dec 23, 2024

jashapiro closed this as completed Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wave containers not being pulled #85

Wave containers not being pulled #85

jashapiro commented Jul 29, 2024

jashapiro commented Jul 29, 2024

jashapiro commented Jul 30, 2024

jashapiro commented Jul 30, 2024

jashapiro commented Jul 30, 2024

jashapiro commented Jul 30, 2024

jashapiro commented Aug 5, 2024

jashapiro commented Aug 15, 2024

jashapiro commented Sep 10, 2024

jashapiro commented Sep 11, 2024

jashapiro commented Sep 12, 2024

jashapiro commented Sep 13, 2024

jashapiro commented Nov 14, 2024

jashapiro commented Nov 14, 2024 •

edited

Loading

jashapiro commented Nov 15, 2024

jashapiro commented Dec 23, 2024

jashapiro commented Dec 24, 2024

jashapiro commented Dec 24, 2024

jashapiro commented Dec 24, 2024

Wave containers not being pulled #85

Wave containers not being pulled #85

Comments

jashapiro commented Jul 29, 2024

jashapiro commented Jul 29, 2024

jashapiro commented Jul 30, 2024

jashapiro commented Jul 30, 2024

jashapiro commented Jul 30, 2024

jashapiro commented Jul 30, 2024

jashapiro commented Aug 5, 2024

jashapiro commented Aug 15, 2024

jashapiro commented Sep 10, 2024

jashapiro commented Sep 11, 2024

jashapiro commented Sep 12, 2024

jashapiro commented Sep 13, 2024

jashapiro commented Nov 14, 2024

jashapiro commented Nov 14, 2024 • edited Loading

jashapiro commented Nov 15, 2024

jashapiro commented Dec 23, 2024

jashapiro commented Dec 24, 2024

jashapiro commented Dec 24, 2024

jashapiro commented Dec 24, 2024

jashapiro commented Nov 14, 2024 •

edited

Loading