You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some Radar Celery tasks seem to crash sometimes. We can see in the production that sometimes the systemd daemon for the Celery worker has died and it hasn't restarted automatically. The comparison of submissions gets stuck since the worker is dead and must be manually restarted.
Investigate why the workers crash sometimes and fix it.
One likely cause is that the system runs out of memory at some point. The basic programming course has over 1000 students. Some submission files may also be larger than expected and some students accidentally submit non-python files, for example, compiled binary pyc files. There have been other errors about large submission files. If Radar is not able to download submission files larger than 100 MB from A+, then we know that the comparison does not crash due to gigantic files. Is 100 MB already too much, though? Is the memory error caused by the large number of students and submissions or large submission files? The log below shows that the the worker tries to load input data from a serialized string and then crashes to MemoryError. If the string contains the comparison results of all students in the course, then maybe it grows too large with a thousand students.
After some worker dies to MemoryError and can not restart after it, other workers start to have connection errors (connection refused, connection reset by peer) probably because the target is really dead and not responding. Sometimes all workers die completely and sometimes only one of them needs to be manually restarted.
MemoryError in Radar logs:
[2020-11-10 15:25:35,297: ERROR/ForkPoolWorker-1] Pool process <celery.concurrency.asynpool.Worker object at 0x7fe9a7b5aa58> error: MemoryError()
Nov 10 17:25:35 radar.cs.hut.fi radar_celery_main[5041]: Traceback (most recent call last):
File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 289, in __call__
sys.exit(self.workloop(pid=pid))
File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 347, in workloop
req = wait_for_job()
File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 447, in receive
ready, req = _receive(1.0)
File "/srv/radar/venv/lib/python3.5/site-packages/billiard/pool.py", line 419, in _recv
return True, loads(get_payload())
File "/srv/radar/venv/lib/python3.5/site-packages/billiard/common.py", line 107, in pickle_loads
return load(BytesIO(s))
MemoryError
[2020-11-10 15:25:41,265: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:5050 exited with 'exitcode 1'
The text was updated successfully, but these errors were encountered:
Some Radar Celery tasks seem to crash sometimes. We can see in the production that sometimes the systemd daemon for the Celery worker has died and it hasn't restarted automatically. The comparison of submissions gets stuck since the worker is dead and must be manually restarted.
Investigate why the workers crash sometimes and fix it.
One likely cause is that the system runs out of memory at some point. The basic programming course has over 1000 students. Some submission files may also be larger than expected and some students accidentally submit non-python files, for example, compiled binary pyc files. There have been other errors about large submission files. If Radar is not able to download submission files larger than 100 MB from A+, then we know that the comparison does not crash due to gigantic files. Is 100 MB already too much, though? Is the memory error caused by the large number of students and submissions or large submission files? The log below shows that the the worker tries to load input data from a serialized string and then crashes to MemoryError. If the string contains the comparison results of all students in the course, then maybe it grows too large with a thousand students.
After some worker dies to MemoryError and can not restart after it, other workers start to have connection errors (connection refused, connection reset by peer) probably because the target is really dead and not responding. Sometimes all workers die completely and sometimes only one of them needs to be manually restarted.
MemoryError
in Radar logs:The text was updated successfully, but these errors were encountered: