-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory limit for pdfalto subprocess in Grobid server not working with docker image #1036
Comments
Hi @sanchay-hai, Could you share the whole docker image identification? Could you also share the command with what you ran it? Did you use The behavior you describe is very strange, let's see what the problem could be. Yesterday, when investigating the #1035 I processed twice about 800 PDF document and did not notice any memory leak, nor any OOM crashes. Finally, do you have by any chance, the log file somewhere? I'd be interested in the latest part of it |
The docker images I tried were Oh sorry I did not try Thanks for the quick response! |
@lfoppiano do you have any recommendations of how to decrease memory usage of the server container? Will decreasing batch_size (from 1000) help or decreasing concurrency (from 10) help? |
Does this work?
You could perhaps use the same batch_size as the number of concurrent jobs allowed by the server. Please check here: https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration |
Not sure still if I used a beefier machine and gave the container 50GB Still the server container crashed after a while with OOM |
Let me try with batch_size=10 |
I'm still not sure I can picture everything you've done (settings, memory and so on). Meanwhile, could you share the logs? I suppose you can see them in the docker console? |
So here is what I'm trying at the moment
Seems to be holding up. But CPU usage has dropped significantly. Attaching a snippet of the logs grobid_server_logs_snippet.txt |
Ok this server also crashed with OOM... here are the logs between when it was killed and restarted at
|
So to me it does seem like there is a potential memory leak. Let me know if you have any clues, or want me to try something |
mmmm.... I'm not sure here what is the issue. I don't see any error or exceptions. When the server crashes the docker restarts it automatically? Do you have the exception of the OOM? Also in the log, I saw this:
Which seems to be a protected PDF document. Could you try to find it back? Maybe this is causing the problems? How many CPUs do you have allocated to the docker? Could you also share the command line you use to run the docker image? |
Yes, I have --restart
No unfortunately, I haven't seen an OOM exception in the logs
Could this lead to a memory leak? How so? I can try to find it if you think this can cause a memory leak
16 vCPUs
|
I think maybe I figured out the reason for OOM kills --- the pdfalto processes consume a lot of memory. So if the main grobid-service process has Xmx of 48G, it will keep growing and growing. Then when the pdfalto process needs to consume memory, they don't have much room left. This causes the entire container to get OOM-killed. Also maybe why we don't see OOM stack trace in the main grobid-service process. Unforunately, I tried with Any ideas how to reduce the memory consumption of the pdfalto processes? |
@sanchay-hai OK, in order to investigate on my side, I need a sample of your PDF documents. |
Hi @sanchay-hai There is normally no memory leak with older version of Grobid, because I tested the server with millions of PDF during several days with JProfiler especially for this. However I did not test more recent versions and I did not test the server when running as docker container because all these tests are time consuming. I don't see anything particular added in the 2 last years that could lead to memory leaks. First you can have a look at this about setting your server: https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration You should not allocate memory to the JVM like this, the server needs free memory for other stuff like pdfalto, JNI and more generally Java uses extra memory outside this allocation. Most of the memory usage of Grobid is outside the JVM (machine learning stuff and pdf parsing with pdfalto). Basically you should not touch
These are normal warning from pdfalto, including the protected PDF (it might be parsed if I remember well, but you are normally not allowed to), nothing unusual. In production, use About the client: If you have 50GB memory, I don't think you should run with any problems with pdfalto, except if you are parsing a lot of exceptionally very long and large PDF with a high concurrency for the server. Normally scholar PDF are not documents with more than 1000 pages, so it is normally not a problem. In case the problem is pdfalto memory usage (possible if you have mega-PDF files, but normally they are not supported by Grobid and not so frequent in the scholar world), there are two parameters to control it (see https://grobid.readthedocs.io/en/latest/Configuration/#pdf-processing).
These 2 parameters are used to protect the server, as circuit breakers. There are unfortunately a lot of problematic PDF around and the external process to parse PDF is used to protect the JVM to crash in scenario with millions of PDF to process. For me these default parameters were working fine when processing very large number of PDF. |
I agree there may not be a memory leak. It is just that the pdfalto processes consume a lot of memory (maybe on large pdfs).
So I started with not assigning these at all, but that causes the machines to basically become unresponsive after a while. I guess because they started swapping out and thrashing badly. So I would prefer to give the container a max memory, kill it and restart. Xmx is also helpful IMO because otherwise the JVM will keep consuming more and more memory. I've noticed with Xmx48G, the grobid-service process itself grew to 35GB+ leaving even less memory for pdfalto processes.
Cool. I do notice OOM crashes more often with higher batch_size. But then throughput it reduced too.
This helps a lot. So with concurrency: 10, that means worst case we can go up to 60GB in alto memory. If I give grobid-service 12GB, that means 72GB should suffice with default parameters. Let me try allocating enough memory. Will keep you guys posted. Thanks a lot! Really appreciate the detailed responses!! |
That didn't work either. I tried |
Good question, I've never seen anything like this happening so far - actually I've never needed to use Grobid on a machine with so much memory even when running with n=24 :) To reproduce the problem, would it be possible to give me some more information about the kind of PDF you are processing (normal scholar article PDF or large phd thesis/books?) and how many PDF are processed to reach this amount of memory? |
Here is the zip file of the pdfs that caused the above OOM (download link is valid for 12h). I was not able to reproduce the issue on my laptop. On my laptop the max memory usage was 19GB with --n 10 and 20GB with --n 20 |
@sanchay-hai thanks for the files. Meanwhile, thanks to the favorable time zone, I did test my instance with your file. I processed them all in 12 minutes, without any OOM or any issue. You can see the log of the grobid client here:
As a side note. The data under I made a test with My grobid service was running for 2-3 weeks already and I've been using for processing stuff, so it's not a fresh container. |
Thank you, I also tried and was unable to reproduce on my laptop. One thing to try to reproduce might be --- run the same zip in a loop. Maybe shuffle the pdf list each time |
I was too slow the token has expired ! |
Here it is again. Feel free to download the zip. |
Thank you @sanchay-hai, I got it ! |
Ok did a little more debugging. In a couple of instances, I checked that when the container went OOM, it was one pdfalto process hogging up all the memory (e.g. growing above 90GB). I know you said that the pdfalto is limited to 6GB, (and I see the code here, is that the right place?), but I'm not sure if the ulimit is getting applied. For example, I see this in the docker container (notice that the "Max address space" is unlimited) Any ideas?
|
@sanchay-hai Thanks a lot for this, I think you're totally right - this is the problem, the memory limit for the pdfalto process is not working in this case. I need to find the time to dive into this and explore a fix. In principle, in server mode, we use the following shell to better control the child process and its termination: |
@sanchay-hai I've gave some though to your problem:
|
So it appears that the memory limit mechanism was not set in server mode. I pushed a fix with PR #1038 (adding the ulimit in the pdfalto_server script as mentioned above) and from my tests it is working correctly now both with java server and docker server. Setting the pdfalto memory limit to something very low in the Grobid config file results as expected in subprocess failing with large PDF. @sanchay-hai could you test again maybe your set of PDF with the following updated docker image: docker pull grobid/grobid-crf:0.8.0-SNAPSHOT |
I've updated the docker image: |
Sorry, I've been busy with other things. Thank you for fixing the issue, feel free to close it. I will get back on this thread if it doesn't work for me, next time I manage to test. |
No problem @sanchay-hai thanks a lot for this issue ! We normally let the person having opened the issue, closing it at her/his convenience. |
What is your OS and architecture? Windows is not supported and Mac OS arm64 is not yet supported. For non-supported OS, you can use Docker (https://grobid.readthedocs.io/en/latest/Grobid-docker/) ---- Amazon Linux
What is your Java version (
java --version
)? ---- Reproduced using the docker images 0.7.3, 0.7.2So we have a long run Grobid server process we run with Xmx18G. What we notice is that processing one batch of ~1000 pdfs consumes 7-10GB, but then processing the 2nd batch of ~1000 pdfs consumes another 7-10GB and eventually the server gets killed with OOM.
This is a consistent finding where the server keeps consuming more and more memory and needs to be restarted. Is there possibly a memory leak? Are there any knobs / workarounds we can play with?
The text was updated successfully, but these errors were encountered: