Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore failed on Oracle Database Server 12c R2 #412

Open
dineshputchala opened this issue Nov 20, 2017 · 38 comments
Open

Restore failed on Oracle Database Server 12c R2 #412

dineshputchala opened this issue Nov 20, 2017 · 38 comments

Comments

@dineshputchala
Copy link

dineshputchala commented Nov 20, 2017

Trying to verify the checkpoint/restore feature on Oracle Database Server 12c R2

There was similar issue (#255) last year when I tried on some non-production docker-1.10.0-dev version.

This time I tried on latest version as this checkpoint/restore is enabled in experimental version of regular release of docker.

Steps followed are:

Enabled experimental flag on "Docker version 17.06.2-ol, build d02b7ab"

bash-4.2$ docker run -d --env-file db_env.dat -p :1521 -p :5500 --name tc --security-opt seccomp:unconfined store/oracle/database-enterprise:12.2.0.1
b1ed6b3ff854241230e357432e779238e4b0a14a32ea9b0661f87697161ac51c

Created checkpoint once the db came up,

bash-4.2$ docker checkpoint create tc tc_ck1
tc_ck1

bash-4.2$ docker checkpoint ls tc
CHECKPOINT NAME
tc_ck1

Trying to start the container again using checkpoint,

bash-4.2$ docker start --checkpoint tc_ck1 tc
Error response from daemon: oci runtime error: criu failed: type NOTIFY errno 0
log file: /var/lib/docker/containers/b1ed6b3ff854241230e357432e779238e4b0a14a32ea9b0661f87697161ac51c/checkpoints/tc_ck1/criu.work/restore-2017-11-17T02:06:14.324615919-08:00/restore.log

@dineshputchala
Copy link
Author

Attached the restore.log

restore.log

@xemul
Copy link
Member

xemul commented Nov 21, 2017

Ugh, there's some problem with AIO ring. Cc @tkhai and @avagin

@tkhai
Copy link
Member

tkhai commented Nov 21, 2017

1)@dineshputchala, which kernel version do you use?
2)Is this easy to reproduce in your envinronment?

@adrianreber
Copy link
Member

From the shell version and the used glibc and other information in the restore.log this could be CentOS or RHEL. Which CRIU version are you using?

Strange that the CRIU version is not visible in the restore.log. We should also put the kernel version in the dump and restore log.

CRIU on CentOS/RHEL needs an extra patch if build from sources: https://git.centos.org/blob/rpms!criu.git/c7/SOURCES!aio-fix.patch

@adrianreber
Copy link
Member

My tests with migrating the oracle database have always failed probably due to problems with monotonic time.

@adrianreber
Copy link
Member

Or better: Migration works, but the database shuts down after migration.

@tkhai
Copy link
Member

tkhai commented Nov 21, 2017

CRIU on CentOS/RHEL needs an extra patch if build from sources: https://git.centos.org/blob/rpms!criu.git/c7/SOURCES!aio-fix.patch

There is because the formula in kernel has changed.
I submitted patches to fix that year ago: https://marc.info/?l=openvz-criu&m=146366354304999&w=2

We do not support old kernels as Pasha said:
https://marc.info/?l=openvz-criu&m=146373758226363&w=2

So, if the kernel is really old, we do not support it.

@dineshputchala
Copy link
Author

@tkhai @adrianreber @avagin

Docker host details:

bash-4.2$ docker -v
Docker version 17.06.2-ol, build d02b7ab

bash-4.2$ uname -a
Linux slc12moz 4.1.12-61.1.27.el7uek.x86_64 #2 SMP Fri Feb 3 12:31:56 PST 2017 x86_64 x86_64 x86_64 GNU/Linux

bash-4.2$ cat /etc/oracle-release
Oracle Linux Server release 7.3

OS is Oracle Linux 7.3

CRIU version on docker host: criu-2.12-2.el7.x86_64

Its easy to reproduce , just we need to start the db container and checkpoint it. Try restore using checkpoint and it is reproduced every time.

@avagin
Copy link
Member

avagin commented Nov 22, 2017

@dineshputchala could you try the same with criu 3.6?

@dineshputchala
Copy link
Author

@avagin In the link shared by you, Bug #296 is mentioned and it is in open state . This Bug seems to be new feature to be added in CRIU for oracle db restore issue . Are you saying its same issue in this bug as well ?

@avagin
Copy link
Member

avagin commented Nov 22, 2017

In this bug you met another issue, but it is very probable that #296 will be the next one.

@adrianreber
Copy link
Member

So this is interesting. If oracle linux uses the RHEL criu package on a newer kernel with the special AIO patch I added for RHEL it will not work. @dineshputchala you need to talk to your vendor and tell them that their criu package is wrong.

@dineshputchala
Copy link
Author

dineshputchala commented Nov 27, 2017

Installed CRIU 3.6 on my machine by building it as this package was not available in my repos. It took some time due to resolving lot of dependencies while building CRIU code and it was not easy and straight forward !

Hurray ... Finally I could install CRIU 3.6 version !

@dineshputchala
Copy link
Author

@avagin Attempted checkpoint/restore expt again on Oracle Database Server 12c R2 with latest CRIU version (3.6) !

This time, its different story ...

I was able to do checkpoint and restore did not throw any error but the db inside the container was not brought up successfully.

bash-4.2$ docker checkpoint create cont_criu3 cont_criu3_chk
cont_criu3_chk

bash-4.2$ docker checkpoint ls cont_criu3
CHECKPOINT NAME
cont_criu3_chk

bash-4.2$ docker start --checkpoint cont_criu3_chk cont_criu3
bash-4.2$

I checked the alert logs and I could see below errors and warnings,

Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process
Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process
Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process
Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process

Warning: VKTM detected a forward time drift.
Warning: 52 processes are still attach to shmid 98307:
Warning: 51 processes are still attach to shmid 98307:
...
...

This seems to be same issue as observed in #296
Attached alert log as well , Please check !

alert.log

@adrianreber
Copy link
Member

@dineshputchala nice, now you need to talk to oracle that they should support migration. The oracle database seems to have problems if the time changes. This is expected as the time will keep on running as long as your container is stopped. It is even worse for migration as the kernel timers on the destination system will be completely different. So this is unrelated to CRIU and needs to be changed in the database.

A time namespace in the kernel could be a solution to handle this but this needs to be implemented in the kernel.

@dineshputchala
Copy link
Author

Similar issue is observed in bug #296 which requires changes in kernel and then in CRIU.

@dineshputchala
Copy link
Author

From alert log,

Warning: VKTM detected a forward time drift.
Time drifts can result in unexpected behavior such as time-outs.
Please see the VKTM trace file for more details:
/u01/app/oracle/diag/rdbms/orclcdb/ORCLCDB/trace/ORCLCDB_vktm_64.trc
2017-11-27T06:43:17.786116+00:00
PMON (ospid: 58): terminating the instance due to error 472

Same issue is observed in Bug #296 which requires changes in kernel and then in CRIU.

@dineshputchala
Copy link
Author

dineshputchala commented Sep 3, 2018

Any update on this time-namespace feature implementation in kernel and CRIU ?

@tkhai
Copy link
Member

tkhai commented Sep 3, 2018

Andrey will say better about criu status, since he is diving into this at the moment. But I want to touch another direction. @dineshputchala, have you tried to request Oracle to workaround this issue for a while, before we have solution in kernel and criu?

@dineshputchala
Copy link
Author

dineshputchala commented Sep 3, 2018

@adrianreber @avagin Any update on implementation of time-namespace ?

@avagin
Copy link
Member

avagin commented Sep 5, 2018

@dineshputchala
Copy link
Author

Any update on this time-namespace feature implementation in kernel ?

Any update on CRIU changes for supporting this ?

@avagin
Copy link
Member

avagin commented Jan 11, 2019

@dineshputchala We sent the rfc version:
https://lkml.org/lkml/2018/9/19/950

then we discussed it on LCP:
https://www.youtube.com/watch?v=sjRUiqJVzOA&t=93s

And now we are working on the second version of these patches. We are going to post them this month.

@dineshputchala
Copy link
Author

@avagin ..after kernel changes , CRIU also needs to do changes right to use this feature right ?

@avagin
Copy link
Member

avagin commented Feb 21, 2019

@dineshputchala yes, we will need to add some code in CRIU to support time namespaces. But this should not be hard.

@dineshputchala
Copy link
Author

@avagin Which kernel version has the support for time namespaces ? CRIU support for time namespaces is done ?

@rst0git
Copy link
Member

rst0git commented Aug 1, 2019

@dineshputchala
Copy link
Author

@rst0git where do I check which kernel version has picked up these changes ?
@avagin Is CRIU support for time namespace is also done ?

@rst0git
Copy link
Member

rst0git commented Aug 12, 2019

@dineshputchala the patch series for time namespace is not merged upstream yet. The link above is to the latest version of this patch series.

@dineshputchala
Copy link
Author

@rst0git Is time namespace support available in OracleLinux 7.x OS ? which Kernel versions has this support ?
@avagin Which version of CRIU has time namespace support ? Is it there in Version: 3.12 ?

@avagin
Copy link
Member

avagin commented Oct 31, 2020

@dineshputchala Time namespaces have been merged in v5.6, but the current implementation doesn't fix this issue. We need to save/restore start_time for processes to fix this issue.

@Snorch
Copy link
Member

Snorch commented Dec 7, 2020

@avagin @dineshputchala Just in case it would help someone to save/restore start_time for processes

Here is how we do it in Virtuozzo criu+kernel (sadly without time namespaces yet).

criu patch:
https://src.openvz.org/projects/OVZ/repos/criu/commits/0890b5bb810411b43cb4fbc4abf615ce4852a17f
kernel patch (prctl to restore start time):
https://src.openvz.org/projects/OVZ/repos/vzkernel/commits/5523457019b76e044d1df120ac2879094520f232#kernel/sys.c

I don't say that this is a right way, I understand that time namespace way is the right one, but hope it can help.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@dineshputchala
Copy link
Author

@avagin Is time namespaces feature is implemented completely in kernel ? If yes, which version of OracleLinux 7.x OS/OracleLinux 8 has this feature ?

@adrianreber
Copy link
Member

@dineshputchala I am pretty sure nobody here knows which OracleLinux kernel has which feature. CRIU's CI is enabling time namespace tests on anything >= 5.11. So you need to figure out if OracleLinux has all the time namespace patches from 5.11.

@rst0git
Copy link
Member

rst0git commented Dec 8, 2021

Oracle Linux 8.4 maybe?

Oracle Linux 8 includes the time namespace.

https://docs.oracle.com/en/operating-systems/oracle-linux/8/relnotes8.4/ol8-features-changes.html

@dineshputchala
Copy link
Author

There is CRIU bug on OracleLinux8.4 and looks like this needs to be fixed in OL8.

#1683

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants