Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Since 2.1.3 / October 2024, many recipes are exiting with Docker exit code 137 (memory exhausted) #325

Open
benoit74 opened this issue Oct 31, 2024 · 13 comments

Comments

@benoit74
Copy link
Collaborator

Not sure there is much to do, maybe this is linked to the move to the "non-slim" Docker image? Or anything else changed in the environment?

See e.g. https://farm.openzim.org/pipeline/b04c3e6f-ded2-47e7-84f7-bbac8def6a8e and https://farm.openzim.org/pipeline/6e227685-1dbf-4399-90b5-10d73abb81cb and https://farm.openzim.org/pipeline/2b6fdba8-1f72-4802-9b86-13eede679968

@rgaudin
Copy link
Member

rgaudin commented Nov 1, 2024

What's the size delta in biggest XML file between previously successful and now crashing?

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 1, 2024

It is the same XML due to openzim/zimfarm#1041

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 1, 2024

Oh no, maybe it is not same XML. Recipes ran on 7th of May and dumps are dated from 15th of May in S3. How do I get previous size since the files are gone, and the logs as well?

@rgaudin
Copy link
Member

rgaudin commented Nov 1, 2024

I feel we should not look for an external cause although one will most likely present itself.

  • It stopped at 27% ; during questions processing.
  • RAM-hungry step was passed already
  • there are only 5738 questions.
  • this leads towards a leak somewhere.
  • I see many 429 Client Error: Too Many Requests for url: https://i.sstatic.net in the logs. A ticket should probably be opened about that. Exceptions in threads can lead to terrible consequences. Probably an area to check.
  • If base image changed, trying with previous image should be tested early to rule it out or not.

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 1, 2024

Thank you, I've opened #326

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 1, 2024

RAM-hungry step was passed already

Which is the RAM-hungry step ?

@rgaudin
Copy link
Member

rgaudin commented Nov 1, 2024

that's preparation.py ; it's mostly (purposedly) done via other tools started as subprocesses ; manipulating the XML files

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 1, 2024

Running 3dprinting.stackexchange.com_en with old 2.1.2 (instead of 2.1.3) succeeds with old amount of RAM: https://farm.openzim.org/pipeline/395cdbf7-612e-4be9-b514-200f842c76a1/debug

That being said, dependencies are not pinned on sotoki so lot's of things might have changed between 2.1.2 and 2.1.3

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 4, 2024

I've ran or.stackexchange.com_en on my machine and it is very strange.

I started by running 2.1.2 image, with top running from inside the container.

While memory usage at the beginning was quite moderate, when the scraper started to process Questions, I saw very high memory usage, up to 2.2G at the end of the crawl (or maybe even higher, but didn't saw that):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    356 root      20   0    6.2g   2.2g   0.0g S   0.9   3.5   4:01.28 sotoki
      1 root      20   0    0.0g   0.0g   0.0g S   0.0   0.0   0:00.02 bash
      9 root      20   0    0.1g   0.0g   0.0g S   0.0   0.0   0:01.30 redis-server
    364 root      20   0    0.0g   0.0g   0.0g R   0.0   0.0   0:00.04 top

I then started instrumentation on my machine, and ran again 2.1.2 and then 2.1.3. In both case, the memory usage was very comparable about 450M at peak. Below 2.1.2 is in green and 2.1.3 is in blue.

Image

Then only remark is that 2.1.3 seems to run a little bit faster, but probably not a big deal.

This means I did not reproduced what I saw the first time I ran 2.1.2 and observed container processes with top.

I started again 2.1.2 with top running from inside the container, and I got same result as benchmarking graph.

I started again the recipe on same worker athena18 on which it failed before, with unmodified recipe, and it worked well: https://farm.openzim.org/pipeline/74c50440-841c-40bf-acd2-a75652bcc4c0

So it looks like there is some environmental factor in the expression of this issue. I will continue investigate.

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 4, 2024

Problem of first run leaking memory reproduced with 2 successives runs of 2.1.2 of windowsphone.stackexchange.com_en on my machine:

Image

This confirms that:

  • problem is not linked to 2.1.3, it was still there in 2.1.2
  • problem is linked to some environmental factor causing a kind of memory leak (we used two different docker containers, without any mounted volumes but same image, so only the environment - web? - changes)

@rgaudin
Copy link
Member

rgaudin commented Nov 4, 2024

thnak you @benoit74; investigating memory leak with python is difficult. From my experience, it requires extreme rigor and documentation so that apples can be compared to apples as much as possible.

You're lucky you have both working and leaking scenarios in different images. I suggest you bisect the changes and test to find the culprit change(s). I'd start with reverting the dependencies update.

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 4, 2024

I probably nailed down the problem: in first run, we download / resize / upload to S3 cache many pictures. In subsequent runs, we only download the picture from cache.

I just ran again vegetarianism.stackexchange.com:

  • there was still (after previous Zimfarm runs) 132 images to download / resize / upload to S3
  • out of them, 49 raised a Resize Error for ...: 'Image is too small, Image size : xxx, Required size : 540' log
  • in total, 215 picture are directly fetched from S3 cache in second run (and none were downloaded from online)

So something is leaking memory in this async execution.

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 4, 2024

What I've found so far:

  • reducing the number of image executor workers from 100 to 10 does not changes much (if anything) in terms of memory consumption
  • there is no big picture to download which could allocated big amount of memory
  • it looks like the problem happen "somewhere" in the middle of questions processing, quite close to the end (nothing very precise, more a feeling based on few observations, to be analyzed further)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants