Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python Supervisord] criu restore is successful, But the supervisord program coredump #1477

Open
throwbear opened this issue May 15, 2021 · 9 comments

Comments

@throwbear
Copy link

throwbear commented May 15, 2021

I checkpointed and restored a python supervisord program.
It seems that restore was successful, but a coredump occurred after running for a while:

At the (gdb) prompt, just run the following command:

root@XXXXXXXX:/# gdb -c supervisord.17.core --args python2.7
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python2.7...Reading symbols from /usr/lib/debug/.build-id/0d/a6b8d03dd0c23b9dc6009dd14162bc23cbfb2a.debug...done.
done.

warning: core file may not match specified executable file.
[New LWP 17]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/python /usr/bin/supervisord -c /etc/supervisor/supervisord.conf'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fb4b67fffff in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007fb4b67fffff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fb4b680142a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fb4b683dc00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007fb4b6843fc6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007fb4b6846491 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007fb4b6847f64 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000562e8e19cc5f in PyList_New () at ../Objects/listobject.c:152
#7 0x0000562e8e2021e5 in dict_values.lto_priv () at ../Objects/dictobject.c:1294
#8 0x0000562e8e1c7a6f in call_function (oparg=, pp_stack=0x7ffe17a75238) at ../Python/ceval.c:4336
#9 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
#10 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75388, func=) at ../Python/ceval.c:4437
#11 call_function (oparg=, pp_stack=0x7ffe17a75388) at ../Python/ceval.c:4372
#12 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
#13 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a754d8, func=) at ../Python/ceval.c:4437
#14 call_function (oparg=, pp_stack=0x7ffe17a754d8) at ../Python/ceval.c:4372
#15 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
#16 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75628, func=) at ../Python/ceval.c:4437
#17 call_function (oparg=, pp_stack=0x7ffe17a75628) at ../Python/ceval.c:4372
#18 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
#19 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75778, func=) at ../Python/ceval.c:4437
#20 call_function (oparg=, pp_stack=0x7ffe17a75778) at ../Python/ceval.c:4372
#21 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
#22 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a758c8, func=) at ../Python/ceval.c:4437
#23 call_function (oparg=, pp_stack=0x7ffe17a758c8) at ../Python/ceval.c:4372
#24 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
#25 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75a18, func=) at ../Python/ceval.c:4437
#26 call_function (oparg=, pp_stack=0x7ffe17a75a18) at ../Python/ceval.c:4372
#27 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
#28 0x0000562e8e1c0c35 in PyEval_EvalCodeEx () at ../Python/ceval.c:3584
#29 0x0000562e8e1c83e9 in fast_function (nk=0, na=0, n=, pp_stack=0x7ffe17a75c28, func=) at ../Python/ceval.c:4447
#30 call_function (oparg=, pp_stack=0x7ffe17a75c28) at ../Python/ceval.c:4372
#31 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
#32 0x0000562e8e1c0c35 in PyEval_EvalCodeEx () at ../Python/ceval.c:3584
#33 0x0000562e8e1c09d9 in PyEval_EvalCode (co=, globals=, locals=) at ../Python/ceval.c:669
#34 0x0000562e8e1f144f in run_mod.lto_priv () at ../Python/pythonrun.c:1376
#35 0x0000562e8e1ec262 in PyRun_FileExFlags () at ../Python/pythonrun.c:1362
#36 0x0000562e8e1ebdae in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:948
#37 0x0000562e8e19c5f1 in Py_Main () at ../Modules/main.c:640
#38 0x00007fb4b67ed2e1 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#39 0x0000562e8e19be1a in _start ()
(gdb) py-bt
Traceback (most recent call first):
File "/usr/lib/python2.7/dist-packages/supervisor/process.py", line 784, in get_dispatchers
for process in self.processes.values():
File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 141, in get_process_map
process_map.update(group.get_dispatchers())
File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 192, in runforever
combined_map.update(self.get_process_map())
File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 101, in run
self.runforever()
File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 84, in main
self.run()
File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 374, in go
d.main()
File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 364, in main
go(options)
File "/usr/bin/supervisord", line 11, in
load_entry_point('supervisor==3.3.3', 'console_scripts', 'supervisord')()

@adrianreber
Copy link
Member

Hmm, that is a tough one. If an application crashes sometimes after restore it is always hard to tell.

The only idea I currently have is that it might be related to jumps in the clock. There is a ticket at supervisord (Supervisor/supervisor#1043) where jumps in the clock lead to problems. Maybe you are seeing something similar.

Are you moving the checkpoint to another machine? Are you rebooting between checkpoints? Maybe you could try to work with a time namespaces to avoid jumps of the clock. A quick look at the source code of supervisord seems to indicate that it is using CLOCK_MONOTONIC. A time namespace could help.

It also seems you are using a really old version of supervisord. Newer version have changes to deal with clock jumps.

Maybe the clock thing can help you. Maybe it is all completely wrong.

@rst0git
Copy link
Member

rst0git commented May 15, 2021

@throwbear Does the problem occur with Python 3 as well?

@throwbear
Copy link
Author

throwbear commented May 17, 2021

Hmm, that is a tough one. If an application crashes sometimes after restore it is always hard to tell.

The only idea I currently have is that it might be related to jumps in the clock. There is a ticket at supervisord (Supervisor/supervisor#1043) where jumps in the clock lead to problems. Maybe you are seeing something similar.

Are you moving the checkpoint to another machine? Are you rebooting between checkpoints? Maybe you could try to work with a time namespaces to avoid jumps of the clock. A quick look at the source code of supervisord seems to indicate that it is using CLOCK_MONOTONIC. A time namespace could help.

It also seems you are using a really old version of supervisord. Newer version have changes to deal with clock jumps.

Maybe the clock thing can help you. Maybe it is all completely wrong.

thanks a lot.
Actually, I was trying to restore the process in another machine.
Do you mean that a time difference between two machines leads to the supervisord aborted after executing criu-restore

@adrianreber
Copy link
Member

Actually, I was trying to restore the process in another machine.

Good to know.

Do you mean that a time difference between two machines leads to the supervisord aborted after executing criu-restore

Could be. At this point I would say this is not a CRIU problem. You should at least first try the newest version of supervisord with Python 3 to see if the error goes away. If that does not help try it with a time namespace.

Anyway, I am just guessing because I saw that there used to be problems with supervisord and clocks. It still might be something completely different.

@throwbear
Copy link
Author

throwbear commented May 17, 2021

Could be. At this point I would say this is not a CRIU problem. You should at least first try the newest version of supervisord with Python 3 to see if the error goes away. If that does not help try it with a time namespace.

Anyway, I am just guessing because I saw that there used to be problems with supervisord and clocks. It still might be something completely different.

As this issue describe, subprocess started by supervisord would hang for a long time if there is a out-of-sync between machines, BUT it would not cause supervisord coredump badly.

there is a similar issue #Supervisor/supervisor#1430

criu version 3.15

@rst0git
Copy link
Member

rst0git commented May 17, 2021

@throwbear It's not supervisord coredump, its Python, but Python 2.7 has been deprecated. If you could replicate the issue with Python 3 it would be easier to debug.

Actually, I was trying to restore the process in another machine.
Do you mean that a time difference between two machines leads to the supervisord aborted after executing criu-restore

You can try to restore supervisord first on the same host. To successfully migrate supervisord between two machines you need to make sure that all libraries (e.g. libc) are exactly the same.

@throwbear
Copy link
Author

You can try to restore supervisord first on the same host. To successfully migrate supervisord between two machines you need to make sure that all libraries (e.g. libc) are exactly the same.

Thanks a lot. I have install two machine by the same linux images. AND the supervisord-coredump case DOES NOT happen every time when i execute criu-restore from one machine to another. It happened occasionally

@avagin
Copy link
Member

avagin commented May 27, 2021

What kernel do you use? Does it support time namespaces?

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants