Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wave containers not being pulled #85

Closed
jashapiro opened this issue Jul 29, 2024 · 18 comments
Closed

Wave containers not being pulled #85

jashapiro opened this issue Jul 29, 2024 · 18 comments
Assignees

Comments

@jashapiro
Copy link
Member

When trying to deploy the v0.1.0 and v0.1.1 releases, we ran into errors where containers were not being pulled as expected, leading to repeated failures. Test runs do just fine, but the later runs started to fail, which seems to suggest that there might still be an issue with rate limits related to wave containers.

While we are populating the TOWER_ACCESS_TOKEN environment variable on launch, and we know this is getting accepted because we can monitor the runs on Seqera cloud, I wonder if we might actually need to populate tower.accessToken configuration variable as well/instead for wave containers specifically. While this would surprise me, I can't think of another reason at the moment.

There is also a possibility that there is an error somehow related to wave pulling containers from ECR, though again this seems unlikely to be a root cause.

My plan is to wait and try to deploy the 0.1.1 release one more time, but to first have a more careful look at the log files to try to determine if there is something more specific that I can find.

If this can not be resolved, we may need to abandon the use of wave containers, but I am hoping it does not come to that, and I will try to ask Seqera for assistance before we get to that stage.

@jashapiro
Copy link
Member Author

I updated nextflow to on the AWS workload machine to the latest version, 24.04.3, and triggered a manual run for workflow version v0.1.1 (only the real data, as the simulations had gone fine). That is currently running at https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/4Rrf1pyhMkOBw6 and seems to be working as expected, pulling images as expected.

Hoping that there was a bug in the previous nextflow version, though I can't rule out that the time delay might mean that we have gotten past any throttle limits before I began the run.

I did check the previous logs and confirmed that the token was being sent as expected.

@jashapiro
Copy link
Member Author

The previous run is on track to complete, with only merge tasks remaining, despite a series of apparent failures for container pulls (zero execution time) during doublet detection. I am still not sure what is going on exactly, but I will keep this open for monitoring.

If we are running into pull limits, one solution may be to batch samples together to reduce the number of jobs.

Alternatively, if we can get a way to use the Fusion file system without relying on so heavily on the Wave system, that might be ideal.

@jashapiro
Copy link
Member Author

I take it back. The job seems to have stalled, and we are still seeing container pull errors. I'm still not sure what is going on.

@jashapiro
Copy link
Member Author

Looking at the actual error on AWS Batch, I see:

CannotPullContainerError: Error response from daemon: unknown: repository 'public.ecr.aws/openscpca/doublet-detection:v0.1.0' bad request (400)

So now I wonder if it might be a problem with pulling the base layer that the wave container is built on.

Are we maybe hitting a limit for the AWS public ECR? If so, I wonder if we can pass any additional permissions to the nextflow user. I will test adding ECR read permissions and see if that gets us anywhere.

@jashapiro
Copy link
Member Author

I can confirm that all the doublet_detection jobs in https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/VcxEXOaURslUi have now actually finished, and without any docker pull errors.

So things seem to be good, for now.

@jashapiro
Copy link
Member Author

This error seems to be related to throttling of the ECR pull requests from the public channel. I'm somewhat surprised they are throttling pulls to batch instances, but that is my best guess. I did add a few more permissions to the role that is used by the instances (SSM-core), but I am not sure whether that made the difference.

If this problem recurrs/persists, I think the best option may be to try to reduce the total number of jobs that we send.
This w mean updating/modifying processes to allow them to take sets of samples, rather than only one at a time.
We could revert to expecting modules to take a project at a time, though this might create more unbalanced loads, or add some functionality to co create lists of values as inputs based on collate() or buffer() to create the sets of inputs, probably followed by a `transpose() operation. I haven't fully thought this through though!

We can reevaluated if/when we next see throttling errors, I expect

@jashapiro
Copy link
Member Author

Looking at the actual error on AWS Batch, I see:

CannotPullContainerError: Error response from daemon: unknown: repository 'public.ecr.aws/openscpca/doublet-detection:v0.1.0' bad request (400)

Sadly, we are again hitting the same errors with the latest (staging) run. I think the next move is probably to adjust the workflows to try to batch jobs into larger work units. I expect to be able to limp along and rerun the workflow to completion with the current settings by running the simulated and real data separately, but this should be addressed for the future.

@jashapiro
Copy link
Member Author

I have been testing this in https://github.com/jashapiro/nf-wave-test and was able to regenerate the behavior yesterday with the same CannotPullContainerError. At the time I was (I think) at the throttles-somteimes tag. I submitted 1000 jobs cleanly in an initial, but then when I submitted more I immediately started to get the pull error.

I continued onward to do a bit more testing, which included some consolidations in preparation for submission of a bug report, and testing to see if we would still have the same error using our previous AWS batch stack. When I started to test those today, I was excited to see that the error was no longer coming up, meaning that it was likely our new batch stack. I went back to our new batch stack though, and now I am longer getting errors, despite submitting thousands of jobs.

I reverted to the tagged version when I was previously getting errors, and I have not been able to recreate errors.

At some point I had also update Nextflow to 24.04.4 (from 24.04.3), and while I thought that this could be where the fix happened, when I reverted to the previous version, I was still unable to recreate the error with either version of the profiles.

So I am now back to thinking this was a bug at the Seqera end, and perhaps they saw it somehow in logs or other usage. It is also possible that the fix was in a plugin which was not redownloaded when I downgraded, as I don't know how plugins are handled exactly. To account for that eventuality, I upgraded nextflow on the workload server, and will monitor to see if the issue remains resolved.

In the meantime, I found a few places where we seem to have settings that are no longer required, which I will be submitting separately.

@jashapiro
Copy link
Member Author

Okay, I got it failing again, at the following commit: 41345f8, though it did initially fail with 4f6ec98 when I was wondering if module-specific binaries might be required.

It just seems that it may take a lot of jobs (or many runs) to induce failure. I have saved a number of log files now, so it should be possible to start to send out some inquiries with tests.

@jashapiro
Copy link
Member Author

Okay, after discussion with the Seqera team on Slack, it does seem that the issue is likely to be ECR API rate limiting. Luckily, this should be fairly straightforward to solve? According to AWS docs that I can see, we will need an account with ecr-public:GetAuthorizationToken and sts:GetServiceBearerToken to allow it to login to the public ECR for authenticated pulls, which vastly increases the rate limit. (I had kind of assumed that maybe other ecr-public:Get/Describe privs would be required, but maybe not, since it is public? we can test this) As far as I can tell, this does not need to be an account with any particular access to other resources, which may be useful as I believe we do need to be able to provide an access key and secret key, as SSO is not supported (though we can use a role, if desired).

Tagging @davidsmejia and @jaclyn-taroni for advice/thoughts on next steps.

@jashapiro
Copy link
Member Author

We have added credentials to the Seqera account used by the batch workflow to allow login to the public ECR, and that seems to have solved the wave container issue! 🎉

@jashapiro
Copy link
Member Author

I may have declared victory too soon. The errors are recurring.

@jashapiro jashapiro reopened this Nov 14, 2024
@jashapiro
Copy link
Member Author

jashapiro commented Nov 14, 2024

I wonder if the better solution may be to add arn:aws:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly to the openscpca-nf roles? (I thought that I had added these permissions to the appropriate roles, but maybe not)

@jashapiro
Copy link
Member Author

I didn't necessarily want to declare victory here yet...

I'm looking at https://docs.aws.amazon.com/AmazonECR/latest/public/public-service-quotas.html and wondering if we are even getting ahead of 10 pulls per second at times, which is the max for "Rate of image pulls to AWS resources" and "Rate of authenticated image pulls". The latter can be updated by increasing a service quota, but the former can not, and it is unclear to me which applies "first." What doesn't fully make sense to me though is why retries don't end up falling outside the window; they should be delayed by

sleep(Math.pow(2, task.attempt) * 200 as long) // sleep to allow for transient errors

But maybe we need to increase the delay there?

We'll see if https://github.com/AlexsLemonade/OpenScPCA-infra/issues/52 does make a difference though.

@jashapiro
Copy link
Member Author

I am reopening this, as we continue to get errors on large runs.

I think I will have to go back to the Seqera folks with more info; Interestingly, when I look at the usage of the AWS creds we have stored with them, they are being accessed, but AWS doesn't seem to be recording them being used for the public ECR access...

@jashapiro jashapiro reopened this Dec 23, 2024
@jashapiro
Copy link
Member Author

I think the issue may have been how we were storing the credentials on Seqera: I had stored them as AWS credentials, but they may have needed to be stored explicitly as a "Container Registry" with the registry server as public.ecr.aws.

Testing now with the credentials moved to that slot.

@jashapiro
Copy link
Member Author

Update: I can see that nextflow/wave did access the credentials today! So I think that means they are actually in the right place now and maybe we can really close this! One more set of tests in progress though.

@jashapiro
Copy link
Member Author

Looks good, closing, hopefully not to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant