Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Celery tasks crash sometimes - MemoryError #13

Open
markkuriekkinen opened this issue Nov 10, 2020 · 0 comments
Open

Some Celery tasks crash sometimes - MemoryError #13

markkuriekkinen opened this issue Nov 10, 2020 · 0 comments
Labels
Milestone

Comments

@markkuriekkinen
Copy link
Contributor

markkuriekkinen commented Nov 10, 2020

Some Radar Celery tasks seem to crash sometimes. We can see in the production that sometimes the systemd daemon for the Celery worker has died and it hasn't restarted automatically. The comparison of submissions gets stuck since the worker is dead and must be manually restarted.

Investigate why the workers crash sometimes and fix it.

One likely cause is that the system runs out of memory at some point. The basic programming course has over 1000 students. Some submission files may also be larger than expected and some students accidentally submit non-python files, for example, compiled binary pyc files. There have been other errors about large submission files. If Radar is not able to download submission files larger than 100 MB from A+, then we know that the comparison does not crash due to gigantic files. Is 100 MB already too much, though? Is the memory error caused by the large number of students and submissions or large submission files? The log below shows that the the worker tries to load input data from a serialized string and then crashes to MemoryError. If the string contains the comparison results of all students in the course, then maybe it grows too large with a thousand students.

After some worker dies to MemoryError and can not restart after it, other workers start to have connection errors (connection refused, connection reset by peer) probably because the target is really dead and not responding. Sometimes all workers die completely and sometimes only one of them needs to be manually restarted.

MemoryError in Radar logs:

[2020-11-10 15:25:35,297: ERROR/ForkPoolWorker-1] Pool process <celery.concurrency.asynpool.Worker object at 0x7fe9a7b5aa58> error: MemoryError()
Nov 10 17:25:35 radar.cs.hut.fi radar_celery_main[5041]: Traceback (most recent call last):
  File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 289, in __call__
     sys.exit(self.workloop(pid=pid))
  File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 347, in workloop
     req = wait_for_job()
   File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 447, in receive
    ready, req = _receive(1.0)
   File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 419, in _recv
     return True, loads(get_payload())
   File "/srv/radar/venv/lib/python3.5/site-packages/billiard/common.py", line 107, in pickle_loads
     return load(BytesIO(s))
 MemoryError
 [2020-11-10 15:25:41,265: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:5050 exited with 'exitcode 1'
@markkuriekkinen markkuriekkinen changed the title Some Celery tasks crash sometimes Some Celery tasks crash sometimes - MemoryError Nov 18, 2020
@markkuriekkinen markkuriekkinen moved this to Todo in A+ sprints Sep 6, 2022
@PasiSa PasiSa added this to the Fall 2022 milestone Sep 6, 2022
@PasiSa PasiSa removed the status in A+ sprints Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

2 participants