Some Celery tasks crash sometimes - MemoryError #13

markkuriekkinen · 2020-11-10T08:13:44Z

Some Radar Celery tasks seem to crash sometimes. We can see in the production that sometimes the systemd daemon for the Celery worker has died and it hasn't restarted automatically. The comparison of submissions gets stuck since the worker is dead and must be manually restarted.

Investigate why the workers crash sometimes and fix it.

One likely cause is that the system runs out of memory at some point. The basic programming course has over 1000 students. Some submission files may also be larger than expected and some students accidentally submit non-python files, for example, compiled binary pyc files. There have been other errors about large submission files. If Radar is not able to download submission files larger than 100 MB from A+, then we know that the comparison does not crash due to gigantic files. Is 100 MB already too much, though? Is the memory error caused by the large number of students and submissions or large submission files? The log below shows that the the worker tries to load input data from a serialized string and then crashes to MemoryError. If the string contains the comparison results of all students in the course, then maybe it grows too large with a thousand students.

After some worker dies to MemoryError and can not restart after it, other workers start to have connection errors (connection refused, connection reset by peer) probably because the target is really dead and not responding. Sometimes all workers die completely and sometimes only one of them needs to be manually restarted.

MemoryError in Radar logs:

[2020-11-10 15:25:35,297: ERROR/ForkPoolWorker-1] Pool process <celery.concurrency.asynpool.Worker object at 0x7fe9a7b5aa58> error: MemoryError()
Nov 10 17:25:35 radar.cs.hut.fi radar_celery_main[5041]: Traceback (most recent call last):
  File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 289, in __call__
     sys.exit(self.workloop(pid=pid))
  File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 347, in workloop
     req = wait_for_job()
   File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 447, in receive
    ready, req = _receive(1.0)
   File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 419, in _recv
     return True, loads(get_payload())
   File "/srv/radar/venv/lib/python3.5/site-packages/billiard/common.py", line 107, in pickle_loads
     return load(BytesIO(s))
 MemoryError
 [2020-11-10 15:25:41,265: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:5050 exited with 'exitcode 1'

The text was updated successfully, but these errors were encountered:

markkuriekkinen added the bug label Nov 10, 2020

markkuriekkinen changed the title ~~Some Celery tasks crash sometimes~~ Some Celery tasks crash sometimes - MemoryError Nov 18, 2020

markkuriekkinen added this to A+ sprints Sep 6, 2022

markkuriekkinen moved this to Todo in A+ sprints Sep 6, 2022

markkuriekkinen mentioned this issue Sep 6, 2022

Producing comparison report fails on some exercises #35

Closed

PasiSa added this to the Fall 2022 milestone Sep 6, 2022

PasiSa removed the status in A+ sprints Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Celery tasks crash sometimes - MemoryError #13

Some Celery tasks crash sometimes - MemoryError #13

markkuriekkinen commented Nov 10, 2020 •

edited

Loading

Some Celery tasks crash sometimes - MemoryError #13

Some Celery tasks crash sometimes - MemoryError #13

Comments

markkuriekkinen commented Nov 10, 2020 • edited Loading

markkuriekkinen commented Nov 10, 2020 •

edited

Loading