-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Since 2.1.3 / October 2024, many recipes are exiting with Docker exit code 137 (memory exhausted) #325
Comments
What's the size delta in biggest XML file between previously successful and now crashing? |
It is the same XML due to openzim/zimfarm#1041 |
Oh no, maybe it is not same XML. Recipes ran on 7th of May and dumps are dated from 15th of May in S3. How do I get previous size since the files are gone, and the logs as well? |
I feel we should not look for an external cause although one will most likely present itself.
|
Thank you, I've opened #326 |
Which is the RAM-hungry step ? |
that's preparation.py ; it's mostly (purposedly) done via other tools started as subprocesses ; manipulating the XML files |
Running 3dprinting.stackexchange.com_en with old 2.1.2 (instead of 2.1.3) succeeds with old amount of RAM: https://farm.openzim.org/pipeline/395cdbf7-612e-4be9-b514-200f842c76a1/debug That being said, dependencies are not pinned on sotoki so lot's of things might have changed between 2.1.2 and 2.1.3 |
I've ran or.stackexchange.com_en on my machine and it is very strange. I started by running 2.1.2 image, with top running from inside the container. While memory usage at the beginning was quite moderate, when the scraper started to process Questions, I saw very high memory usage, up to 2.2G at the end of the crawl (or maybe even higher, but didn't saw that):
I then started instrumentation on my machine, and ran again 2.1.2 and then 2.1.3. In both case, the memory usage was very comparable about 450M at peak. Below 2.1.2 is in green and 2.1.3 is in blue. Then only remark is that 2.1.3 seems to run a little bit faster, but probably not a big deal. This means I did not reproduced what I saw the first time I ran 2.1.2 and observed container processes with top. I started again 2.1.2 with top running from inside the container, and I got same result as benchmarking graph. I started again the recipe on same worker So it looks like there is some environmental factor in the expression of this issue. I will continue investigate. |
Problem of first run leaking memory reproduced with 2 successives runs of 2.1.2 of windowsphone.stackexchange.com_en on my machine: This confirms that:
|
thnak you @benoit74; investigating memory leak with python is difficult. From my experience, it requires extreme rigor and documentation so that apples can be compared to apples as much as possible. You're lucky you have both working and leaking scenarios in different images. I suggest you bisect the changes and test to find the culprit change(s). I'd start with reverting the dependencies update. |
I probably nailed down the problem: in first run, we download / resize / upload to S3 cache many pictures. In subsequent runs, we only download the picture from cache. I just ran again
So something is leaking memory in this async execution. |
What I've found so far:
|
Not sure there is much to do, maybe this is linked to the move to the "non-slim" Docker image? Or anything else changed in the environment?
See e.g. https://farm.openzim.org/pipeline/b04c3e6f-ded2-47e7-84f7-bbac8def6a8e and https://farm.openzim.org/pipeline/6e227685-1dbf-4399-90b5-10d73abb81cb and https://farm.openzim.org/pipeline/2b6fdba8-1f72-4802-9b86-13eede679968
The text was updated successfully, but these errors were encountered: