-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3 cp "Cannot allocate memory" error #5876
Comments
Hi @tthyer 👋🏻 I'm setting up an environment to replicate. I have a couple of clarifying questions.
I will try to reproduce, but if you know the answer to either of these it would help:
|
Hi @kdaily! I am running Cromwell on Batch, and am using the infrastructure provided by AWS Genomics. I've been in communication with them through the cromwell slack channel. They suggested that I provide additional RAM to my batch job, but I don't think that's going to help as I'm supplying a good deal to this job already, and sometimes this job completes without throwing the allocation error. To answer your questions, first set:
Second set:
What I've currently got in flight is just to run the cp commands with I was wondering in particular whether there are some config options I should try? |
Another data point: I had assumed that the batch compute environments were defaulting to Amazon Linux 2, but that wasn't the case; I was still getting version 1. I've updated the compute environment to use AL2. |
I'm testing this with the defaults as well (AL1) both my own stack as well as the AWS Genomics stack. Just downloaded a 50GB file to S3 and triggered some jobs to try and reproduce. I suggested v1 as v2 is built using PyInstaller and uses it's own Python executable, which has caused some other hard-to-debug issues, so I was curious if I could rule out that. Some of the configuration options (namely the I know there are a few places where memory can be configured in compute environments and job definitions - is there enough memory allocated at all levels (the instance type selected as well as the memory made available to the container)? |
Thanks for the background on the awscli versions. |
I used the The number of concurrent requests could be an issue. By default, the S3 client uses 10 concurrent requests (and thus 10 threads). Altering this parameter inside a Batch job (inside a container) is a bit involved, but I'm interested if I can reproduce by artificially increasing the number of requests significantly. Given the sporadic nature of this issue, I'm afraid that getting debug logs of a failure may be the only recourse. Failing that, checking |
I'm going to reach out to the AWS Genomics team as well to see if anything else is known on their side. |
@henriqueribeiro, I've only conducted a few workflow runs since upgrading, and the bug has not recurred yet, but it was very intermittent for us. I'll keep you posted. I opened an issue in the aws-genomics-workflows repository (see just above) for us to discuss different environmental changes we're making in that infrastructure. |
@tthyer what version of Cromwell are you using? |
@kdaily I confirm that with Cromwell 55 is happening. |
I'm also using 55 |
Thanks. I'm in contact with AWS Genomics, and will update once I hear more. |
It's possible that if the scheduling of jobs ends up with many on the same host instance, there could be memory issues related to how containers use memory and how Python does or does not respect those limits: From https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory:
My AWS Genomics contact also noted that container 'swappiness' can play a role. I came across this (external) post regarding large disk writes and memory issues in Docker containers that might shed some light: https://codefresh.io/docker-tutorial/docker-memory-usage/. I also came across this memory issue in containers for the V2 client: From that, we can gather that PyInstaller (the tool used to build the AWS CLI V2 bundle) makes some choices where to write to, including possibly to Long story short, there might be an underlying issue with Python and large disk writes on containers. I don't have a good short term solution though at this time. I'm going to add a new issue to investigate memory usage of the V2 client in containers. I'll mark this as |
Sure! Can you sanitize/redact anything like account numbers and upload it here? |
Here it is: log-events-viewer-result.log Not sure, if there is something interesting in there. If you have any suggestion in order to debug this problem please tell me and I can run some more workflows. Also, I set the |
Thanks for the logs, and for lowering the Does this reproduce for you regularly? If it does, this may be a big ask depending on how you're environment is configured, but can you run this with AWS CLI v1 instead of v2? If I can rule out anything related to the PyInstaller bundle (or confirm it), that would be great! |
Hi @henriqueribeiro, ok, good to know that it occurs in the AWS CLI version 1. I would note that this version is almost three years old and using Python 2, which will not be supported quite soon (July 2021). However given that the same issue occurs, I don't think it's the underlying cause. What are the units on the y-axis? |
Ahh sorry, I missed that. |
After reverting to AWS CLI 2, I added some swap space to the EC2 instances and after running the workflow 2 times, I never got the memory allocation error anymore. Also, I noticed that the swap space was being used, so it seems to be working. |
Thanks for the update @henriqueribeiro - I'll check on that. I think that still points to a container issue, reviewing this: https://docs.docker.com/config/containers/resource_constraints/ Can you verify what the memory settings are for the container, and can you change them? Wondering if memory swappiness of the container is causing this. |
The memory value depends on the task that will run. It can be changed on the task definition. |
Thanks @henriqueribeiro - still looking into this. I don't think that this is likely an interplay between how much memory the CLI is using and what the memory configuration on the containers are. I'm going to mark this as a bug so that we can investigate it further. In my test of downloading a 50GB file, the CLI was peaking at about 380MB of memory usage. |
@henriqueribeiro I'm having this exact problem as well. How did you add swap to your docker container (" I added some swap space to the EC2 instances")? Did this fix your issue? |
@pjongeneel - I've been looking into this a bit more. @henriqueribeiro was using AWS CLI v1 with Python 2 - can you try with Python 3 (we've since dropped support in new versions of the CLI and Python SDK for Python 2). Thanks! |
I have just built a brand new batch cluster and have started getting All containers:
I appreciate this is not much to go on but we have just started experiencing this issue and are currently trying to work out what is happening. I will add more information as I come across it. EDIT: I also appreciate our problem does not seem to be specifically aws-cli related - so will not add any further clutter to this thread. EDIT: I think I got to the bottom our our problem. Our Batch cluster was configured with an additional EBS volume formatted with xfs. This was our Docker root. The docker containers were configured to create an anonymous volume in this EBS volume. After a bit of reading we came across this issue: docker/for-linux#651 Apparently cgroups writeback does not support the xfs file system. Thus, write operations incorrectly calculate the available dirty_background_bytes based on the total available system memory not the available memory assigned to the container. This causes oom errors and memory allocation errors on seemingly innocuous tasks. Since changing the filesystem on this EBS volume the above errors go away. Our problem still might not be related to this issue but if people are experiencing memory issues on Batch tasks with heavy write operations it might be worth checking your underlying filesystem. |
Thanks for your research, @microbioticajon! I'm investigating how this relates to the EKS/Kubernetes issues mentioned there as well. |
Looks like I might have spoken too soon. Im no longer getting oom errors, cat+redirect seem happy however Im still getting the odd memory allocation error thrown by aws s3 cp under load: AWS CLI v2.2.26 Docker info on the worker node yields: |
I think I may have resolved our awscli cp memory allocation error on our Batch cluster. By default the Batch/ECS optimised ami does not provision any swap. From the Batch docs, it suggests that the default job definition values for the linuxParameters section are swappiness:60 and maxSwap:2x_allocated_memory which seems contradictory. I tried setting swappiness:0 maxSwap:0 within the job definition but awscli cp memory allocation errors still occurred. I added a small swap partition to the node launch_template and initialised the volume in the node user_data and we are now no longer getting memory allocation errors from awscli. Im also limiting maxSwap to 500MB in the job definitions in case it tries to use more swap than is available on large memory jobs. Sorry I cannot be more specific but perhaps this information is useful to someone more knowledgeable than myself. |
Thanks for that comment @microbioticajon! We experienced a similar issue with Kubernetes, and this had to do with how memory usage was being reported. It was considering cached memory as counting to total memory usage. The operating system was keeping the full size of a file downloaded from S3 in the cache, so some memory reporting was adding this to the total even though that memory could be freed for use at any moment by the operating system. I think we have enough evidence to close this out based on your Batch/ECS experience. |
|
Hi @kdaily Many thanks for the insight. Since adding a swap partition to the Batch node the number of aws s3 cp related memory allocation errors has dropped off considerably. However, in the last week we experienced a few failures when running on large cluster nodes - possibly too many jobs exceeding the maximum swap available? The problem with adding a simple swap volume is that it is configured as part of the launch template and does not scale with either the number of jobs or the size of the node. We have just changed the maxSwap configuration to match the requested job memory. According to the Docker run docs this should turn off swap usage altogether: https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details
So far we have had no failures under load, so it is no worse than adding a fixed swap volume at least (I will update if that changes). So perhaps to summarise solutions from this thread:
|
EDIT: I was able to solve this, it was a kernel issue, see my last edit I'm currently debugging out of memory problems and I think it is related to your issue @microbioticajon, I would be interested to know if you found something in the last year. The host has around 400GB RAM and no swap. Several Docker containers run with memory limited to around 5GB (not sure if they have swap since options
Interestingly, it runs out of RAM when writing to disk (e.g. with I'm thinking that there is something strange that involves Docker, cgroups, cache pages, writeback, EBS (Elastic Block Store), where the Linux kernel or Docker gets confused and goes out of memory instead of using free memory or instead of freeing cache pages. I'm still learning all of these things. I will update if I find something. PS: This looks related and also this Edit 2023-04-14: I was able to reproduce more consistently my problem in the case where DetailsNow I used a EC2 instance of 60GB RAM, and ran on it 3 docker containers with It happened with all the kernels I tried (4.14.309-231.529.amzn2.x86_64, 4.14.248-189.473.amzn2.x86_64 and 5.15.104-63.140.amzn2.x86_64). Sometimes it failed while doing
Apparently this is not a problem of the AWS CLI, or of Boto3. It is a problem where docker containers limited in memory can fail even in the "standard" RAM usage is low. Apparently when heavy writing is done, huge page caches are created in RAM and that goes over the limit imposed in the docker containers. Normally, the Linux Kernel reduces these page caches and no problems happen, but I think there are issues with the linux kernel used in EC2 or with AWS Elastic Block Storage Edit 2023-10-19: In the end, there is an issue with some kernels of Amazon Linux: When a Docker/cgroup writes to disk, the write buffer/page cache size increases and can fill the RAM allocated to the Docker/cgroup, and the process is killed or the The solution I used is to add a swap to the host (not necessarily to the Docker/cgroup). With a few MB of swap, the Linux Kernel will use it and the error won't happen. I also upgraded the Kernel. It was also confirmed by Amazon support that at least these kernels have the problem: ‘4.14.322-246.539.amzn2’, ‘5.10.192-183.736.amzn2’, ‘5.15.128-81.144.amzn2’. In any case, I observed that the issue happens only for containers with an allocated RAM smaller than a certain threshold (e.g. only for Dockers/cgroups with less than 900MB ram), and I observed that when upgrading the kernel, this threshold gets lower (e.g. bug happens only for containers with less than 100MB RAM). Therefore I recommend upgrading the Kernel as much as possible. |
Confirm by changing [ ] to [x] below:
Issue is about usage on:
Platform/OS/Hardware/Device
What are you running the cli on?
ECS via Batch. awscliv2 is installed via a launch template.
Describe the question
Intermittently, I get the following error when trying to download a large file (~45-50GB):
download failed...[Errno 12] Cannot allocate memory
as part of a workflow of batch jobs. This is occurring for batch jobs that each have >=3GB of memory specified for each; the last time this occurred, the batch job had 7GB memory allocated.The command being executed looks something like
/usr/local/aws-cli/v2/current/bin/aws s3 cp --no-progress s3://my-s3-bucket/etc/etc/1000.unmapped.unmerged.bam /tmp/scratch/my-s3-bucket/etc/etc/1000.unmapped.unmerged.bam
Is the python subprocess causing this? What do you recommend to avoid this while running on AWS Batch/ECS?
Logs/output
There are no more informative logs atm -- I will put debugging in so that the next time this happens the debug flag is passed.
The text was updated successfully, but these errors were encountered: